Age | Commit message (Collapse) | Author |
|
Just a small refactor patch in order to improve the code readability.
Signed-off-by: Pablo M. Bermudo Garay <pablombg@gmail.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
|
|
This patch adds support for overloading stateful objects operations
through the select_ops() callback, just as it is implemented for
expressions.
This change is needed for upcoming additions to the stateful objects
infrastructure.
Signed-off-by: Pablo M. Bermudo Garay <pablombg@gmail.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
|
|
This patch adds a new feature to hashlimit that allows matching on the
current packet/byte rate without rate limiting. This can be enabled
with a new flag --hashlimit-rate-match. The match returns true if the
current rate of packets is above/below the user specified value.
The main difference between the existing algorithm and the new one is
that the existing algorithm rate-limits the flow whereas the new
algorithm does not. Instead it *classifies* the flow based on whether
it is above or below a certain rate. I will demonstrate this with an
example below. Let us assume this rule:
iptables -A INPUT -m hashlimit --hashlimit-above 10/s -j new_chain
If the packet rate is 15/s, the existing algorithm would ACCEPT 10
packets every second and send 5 packets to "new_chain".
But with the new algorithm, as long as the rate of 15/s is sustained,
all packets will continue to match and every packet is sent to new_chain.
This new functionality will let us classify different flows based on
their current rate, so that further decisions can be made on them based on
what the current rate is.
This is how the new algorithm works:
We divide time into intervals of 1 (sec/min/hour) as specified by
the user. We keep track of the number of packets/bytes processed in the
current interval. After each interval we reset the counter to 0.
When we receive a packet for match, we look at the packet rate
during the current interval and the previous interval to make a
decision:
if [ prev_rate < user and cur_rate < user ]
return Below
else
return Above
Where cur_rate is the number of packets/bytes seen in the current
interval, prev is the number of packets/bytes seen in the previous
interval and 'user' is the rate specified by the user.
We also provide flexibility to the user for choosing the time
interval using the option --hashilmit-interval. For example the user can
keep a low rate like x/hour but still keep the interval as small as 1
second.
To preserve backwards compatibility we have to add this feature in a new
revision, so I've created revision 3 for hashlimit. The two new options
we add are:
--hashlimit-rate-match
--hashlimit-rate-interval
I have updated the help text to add these new options. Also added a few
tests for the new options.
Suggested-by: Igor Lubashev <ilubashe@akamai.com>
Reviewed-by: Josh Hunt <johunt@akamai.com>
Signed-off-by: Vishwanath Pai <vpai@akamai.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
|
|
Conflicts:
mm/page_alloc.c
Signed-off-by: Ingo Molnar <mingo@kernel.org>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/bluetooth/bluetooth-next
Johan Hedberg says:
====================
pull request: bluetooth-next 2017-09-03
Here's one last bluetooth-next pull request for the 4.14 kernel:
- NULL pointer fix in ca8210 802.15.4 driver
- A few "const" fixes
- New Kconfig option for disabling legacy interfaces
Please let me know if there are any issues pulling. Thanks.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Pablo Neira Ayuso says:
====================
Netfilter updates for net-next
The following patchset contains Netfilter updates for your net-next
tree. Basically, updates to the conntrack core, enhancements for
nf_tables, conversion of netfilter hooks from linked list to array to
improve memory locality and asorted improvements for the Netfilter
codebase. More specifically, they are:
1) Add expection to hashes after timer initialization to prevent
access from another CPU that walks on the hashes and calls
del_timer(), from Florian Westphal.
2) Don't update nf_tables chain counters from hot path, this is only
used by the x_tables compatibility layer.
3) Get rid of nested rcu_read_lock() calls from netfilter hook path.
Hooks are always guaranteed to run from rcu read side, so remove
nested rcu_read_lock() where possible. Patch from Taehee Yoo.
4) nf_tables new ruleset generation notifications include PID and name
of the process that has updated the ruleset, from Phil Sutter.
5) Use skb_header_pointer() from nft_fib, so we can reuse this code from
the nf_family netdev family. Patch from Pablo M. Bermudo.
6) Add support for nft_fib in nf_tables netdev family, also from Pablo.
7) Use deferrable workqueue for conntrack garbage collection, to reduce
power consumption, from Patch from Subash Abhinov Kasiviswanathan.
8) Add nf_ct_expect_iterate_net() helper and use it. From Florian
Westphal.
9) Call nf_ct_unconfirmed_destroy only from cttimeout, from Florian.
10) Drop references on conntrack removal path when skbuffs has escaped via
nfqueue, from Florian.
11) Don't queue packets to nfqueue with dying conntrack, from Florian.
12) Constify nf_hook_ops structure, from Florian.
13) Remove neededlessly branch in nf_tables trace code, from Phil Sutter.
14) Add nla_strdup(), from Phil Sutter.
15) Rise nf_tables objects name size up to 255 chars, people want to use
DNS names, so increase this according to what RFC 1035 specifies.
Patch series from Phil Sutter.
16) Kill nf_conntrack_default_on, it's broken. Default on conntrack hook
registration on demand, suggested by Eric Dumazet, patch from Florian.
17) Remove unused variables in compat_copy_entry_from_user both in
ip_tables and arp_tables code. Patch from Taehee Yoo.
18) Constify struct nf_conntrack_l4proto, from Julia Lawall.
19) Constify nf_loginfo structure, also from Julia.
20) Use a single rb root in connlimit, from Taehee Yoo.
21) Remove unused netfilter_queue_init() prototype, from Taehee Yoo.
22) Use audit_log() instead of open-coding it, from Geliang Tang.
23) Allow to mangle tcp options via nft_exthdr, from Florian.
24) Allow to fetch TCP MSS from nft_rt, from Florian. This includes
a fix for a miscalculation of the minimal length.
25) Simplify branch logic in h323 helper, from Nick Desaulniers.
26) Calculate netlink attribute size for conntrack tuple at compile
time, from Florian.
27) Remove protocol name field from nf_conntrack_{l3,l4}proto structure.
From Florian.
28) Remove holes in nf_conntrack_l4proto structure, so it becomes
smaller. From Florian.
29) Get rid of print_tuple() indirection for /proc conntrack listing.
Place all the code in net/netfilter/nf_conntrack_standalone.c.
Patch from Florian.
30) Do not built in print_conntrack() if CONFIG_NF_CONNTRACK_PROCFS is
off. From Florian.
31) Constify most nf_conntrack_{l3,l4}proto helper functions, from
Florian.
32) Fix broken indentation in ebtables extensions, from Colin Ian King.
33) Fix several harmless sparse warning, from Florian.
34) Convert netfilter hook infrastructure to use array for better memory
locality, joint work done by Florian and Aaron Conole. Moreover, add
some instrumentation to debug this.
35) Batch nf_unregister_net_hooks() calls, to call synchronize_net once
per batch, from Florian.
36) Get rid of noisy logging in ICMPv6 conntrack helper, from Florian.
37) Get rid of obsolete NFDEBUG() instrumentation, from Varsha Rao.
38) Remove unused code in the generic protocol tracker, from Davide
Caratti.
I think I will have material for a second Netfilter batch in my queue if
time allow to make it fit in this merge window.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Using l2tp_tunnel_find() in pppol2tp_session_create() and
l2tp_eth_create() is racy, because no reference is held on the
returned session. These functions are only used to implement the
->session_create callback which is run by l2tp_nl_cmd_session_create().
Therefore searching for the parent tunnel isn't necessary because
l2tp_nl_cmd_session_create() already has a pointer to it and holds a
reference.
This patch modifies ->session_create()'s prototype to directly pass the
the parent tunnel as parameter, thus avoiding searching for it in
pppol2tp_session_create() and l2tp_eth_create().
Since we have to touch the ->session_create() call in
l2tp_nl_cmd_session_create(), let's also remove the useless conditional:
we know that ->session_create isn't NULL at this point because it's
already been checked earlier in this same function.
Finally, one might be tempted to think that the removed
l2tp_tunnel_find() calls were harmless because they would return the
same tunnel as the one held by l2tp_nl_cmd_session_create() anyway.
But that tunnel might be removed and a new one created with same tunnel
Id before the l2tp_tunnel_find() call. In this case l2tp_tunnel_find()
would return the new tunnel which wouldn't be protected by the
reference held by l2tp_nl_cmd_session_create().
Fixes: 309795f4bec2 ("l2tp: Add netlink control API for L2TP")
Fixes: d9e31d17ceba ("l2tp: Add L2TP ethernet pseudowire support")
Signed-off-by: Guillaume Nault <g.nault@alphalink.fr>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
l2tp_tunnel_destruct() sets tunnel->sock to NULL, then removes the
tunnel from the pernet list and finally closes all its sessions.
Therefore, it's possible to add a session to a tunnel that is still
reachable, but for which tunnel->sock has already been reset. This can
make l2tp_session_create() dereference a NULL pointer when calling
sock_hold(tunnel->sock).
This patch adds the .acpt_newsess field to struct l2tp_tunnel, which is
used by l2tp_tunnel_closeall() to prevent addition of new sessions to
tunnels. Resetting tunnel->sock is done after l2tp_tunnel_closeall()
returned, so that l2tp_session_add_to_tunnel() can safely take a
reference on it when .acpt_newsess is true.
The .acpt_newsess field is modified in l2tp_tunnel_closeall(), rather
than in l2tp_tunnel_destruct(), so that it benefits all tunnel removal
mechanisms. E.g. on UDP tunnels, a session could be added to a tunnel
after l2tp_udp_encap_destroy() proceeded. This would prevent the tunnel
from being removed because of the references held by this new session
on the tunnel and its socket. Even though the session could be removed
manually later on, this defeats the purpose of
commit 9980d001cec8 ("l2tp: add udp encap socket destroy handler").
Fixes: fd558d186df2 ("l2tp: Split pppol2tp patch into separate l2tp and ppp parts")
Signed-off-by: Guillaume Nault <g.nault@alphalink.fr>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
This reverts commit 1d6119baf0610f813eb9d9580eb4fd16de5b4ceb.
After reverting commit 6d7b857d541e ("net: use lib/percpu_counter API
for fragmentation mem accounting") then here is no need for this
fix-up patch. As percpu_counter is no longer used, it cannot
memory leak it any-longer.
Fixes: 6d7b857d541e ("net: use lib/percpu_counter API for fragmentation mem accounting")
Fixes: 1d6119baf061 ("net: fix percpu memory leaks")
Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
This reverts commit 6d7b857d541ecd1d9bd997c97242d4ef94b19de2.
There is a bug in fragmentation codes use of the percpu_counter API,
that can cause issues on systems with many CPUs.
The frag_mem_limit() just reads the global counter (fbc->count),
without considering other CPUs can have upto batch size (130K) that
haven't been subtracted yet. Due to the 3MBytes lower thresh limit,
this become dangerous at >=24 CPUs (3*1024*1024/130000=24).
The correct API usage would be to use __percpu_counter_compare() which
does the right thing, and takes into account the number of (online)
CPUs and batch size, to account for this and call __percpu_counter_sum()
when needed.
We choose to revert the use of the lib/percpu_counter API for frag
memory accounting for several reasons:
1) On systems with CPUs > 24, the heavier fully locked
__percpu_counter_sum() is always invoked, which will be more
expensive than the atomic_t that is reverted to.
Given systems with more than 24 CPUs are becoming common this doesn't
seem like a good option. To mitigate this, the batch size could be
decreased and thresh be increased.
2) The add_frag_mem_limit+sub_frag_mem_limit pairs happen on the RX
CPU, before SKBs are pushed into sockets on remote CPUs. Given
NICs can only hash on L2 part of the IP-header, the NIC-RXq's will
likely be limited. Thus, a fair chance that atomic add+dec happen
on the same CPU.
Revert note that commit 1d6119baf061 ("net: fix percpu memory leaks")
removed init_frag_mem_limit() and instead use inet_frags_init_net().
After this revert, inet_frags_uninit_net() becomes empty.
Fixes: 6d7b857d541e ("net: use lib/percpu_counter API for fragmentation mem accounting")
Fixes: 1d6119baf061 ("net: fix percpu memory leaks")
Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
Acked-by: Florian Westphal <fw@strlen.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
When a listener registers to the FIB notification chain it receives a
dump of the FIB entries and rules from existing address families by
invoking their dump operations.
While we call into these modules we need to make sure they aren't
removed. Do that by increasing their reference count before invoking
their dump operations and decrease it afterwards.
Fixes: 04b1d4e50e82 ("net: core: Make the FIB notification chain generic")
Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Reviewed-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
refcount_t type and corresponding API should be
used instead of atomic_t when the variable is used as
a reference counter. This allows to avoid accidental
refcounter overflows that might lead to use-after-free
situations.
v2: added the change in drivers/vhost/net.c as spotted
by Willem.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
In order to convert this atomic_t refcnt to refcount_t,
we need to init the refcount to one to not trigger
a 0 -> 1 transition.
This also removes one atomic operation in fast path.
v2: removed dead code in sock_zerocopy_put_abort()
as suggested by Willem.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Report TCP MD5 (RFC2385) signing keys, addresses and address prefixes to
processes with CAP_NET_ADMIN requesting INET_DIAG_INFO. Currently it is
not possible to retrieve these from the kernel once they have been
configured on sockets.
Signed-off-by: Ivan Delalande <colona@arista.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Extend inet_diag_handler to allow individual protocols to report
additional data on INET_DIAG_INFO through idiag_get_aux. The size
can be dynamic and is computed by idiag_get_aux_size.
Signed-off-by: Ivan Delalande <colona@arista.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Grepping for "sizeof\(.+\) / sizeof\(" found this as one of the first
candidates.
Maybe a coccinelle can catch all of those.
Signed-off-by: Thomas Meyer <thomas@m3y3r.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Three cases of simple overlapping changes.
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Excess of seafood or something happened while I cooked the commit
adding RB tree to inetpeer.
Of course, RCU rules need to be respected or bad things can happen.
In this particular loop, we need to read *pp once per iteration, not
twice.
Fixes: b145425f269a ("inetpeer: remove AVL implementation in favor of RB tree")
Reported-by: John Sperbeck <jsperbeck@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Signed-off-by: Loic Poulain <loic.poulain@linaro.org>
Signed-off-by: Bjorn Andersson <bjorn.andersson@linaro.org>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
|
|
Pull networking fixes from David Miller:
1) Fix handling of pinned BPF map nodes in hash of maps, from Daniel
Borkmann.
2) IPSEC ESP error paths leak memory, from Steffen Klassert.
3) We need an RCU grace period before freeing fib6_node objects, from
Wei Wang.
4) Must check skb_put_padto() return value in HSR driver, from FLorian
Fainelli.
5) Fix oops on PHY probe failure in ftgmac100 driver, from Andrew
Jeffery.
6) Fix infinite loop in UDP queue when using SO_PEEK_OFF, from Eric
Dumazet.
7) Use after free when tcf_chain_destroy() called multiple times, from
Jiri Pirko.
8) Fix KSZ DSA tag layer multiple free of SKBS, from Florian Fainelli.
9) Fix leak of uninitialized memory in sctp_get_sctp_info(),
inet_diag_msg_sctpladdrs_fill() and inet_diag_msg_sctpaddrs_fill().
From Stefano Brivio.
10) L2TP tunnel refcount fixes from Guillaume Nault.
11) Don't leak UDP secpath in udp_set_dev_scratch(), from Yossi
Kauperman.
12) Revert a PHY layer change wrt. handling of PHY_HALTED state in
phy_stop_machine(), it causes regressions for multiple people. From
Florian Fainelli.
13) When packets are sent out of br0 we have to clear the
offload_fwdq_mark value.
14) Several NULL pointer deref fixes in packet schedulers when their
->init() routine fails. From Nikolay Aleksandrov.
15) Aquantium devices cannot checksum offload correctly when the packet
is <= 60 bytes. From Pavel Belous.
16) Fix vnet header access past end of buffer in AF_PACKET, from
Benjamin Poirier.
17) Double free in probe error paths of nfp driver, from Dan Carpenter.
18) QOS capability not checked properly in DCB init paths of mlx5
driver, from Huy Nguyen.
19) Fix conflicts between firmware load failure and health_care timer in
mlx5, also from Huy Nguyen.
20) Fix dangling page pointer when DMA mapping errors occur in mlx5,
from Eran Ben ELisha.
21) ->ndo_setup_tc() in bnxt_en driver doesn't count rings properly,
from Michael Chan.
22) Missing MSIX vector free in bnxt_en, also from Michael Chan.
23) Refcount leak in xfrm layer when using sk_policy, from Lorenzo
Colitti.
24) Fix copy of uninitialized data in qlge driver, from Arnd Bergmann.
25) bpf_setsockopts() erroneously always returns -EINVAL even on
success. Fix from Yuchung Cheng.
26) tipc_rcv() needs to linearize the SKB before parsing the inner
headers, from Parthasarathy Bhuvaragan.
27) Fix deadlock between link status updates and link removal in netvsc
driver, from Stephen Hemminger.
28) Missed locking of page fragment handling in ESP output, from Steffen
Klassert.
29) Fix refcnt leak in ebpf congestion control code, from Sabrina
Dubroca.
30) sxgbe_probe_config_dt() doesn't check devm_kzalloc()'s return value,
from Christophe Jaillet.
31) Fix missing ipv6 rx_dst_cookie update when rx_dst is updated during
early demux, from Paolo Abeni.
32) Several info leaks in xfrm_user layer, from Mathias Krause.
33) Fix out of bounds read in cxgb4 driver, from Stefano Brivio.
34) Properly propagate obsolete state of route upwards in ipv6 so that
upper holders like xfrm can see it. From Xin Long.
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (118 commits)
udp: fix secpath leak
bridge: switchdev: Clear forward mark when transmitting packet
mlxsw: spectrum: Forbid linking to devices that have uppers
wl1251: add a missing spin_lock_init()
Revert "net: phy: Correctly process PHY_HALTED in phy_stop_machine()"
net: dsa: bcm_sf2: Fix number of CFP entries for BCM7278
kcm: do not attach PF_KCM sockets to avoid deadlock
sch_tbf: fix two null pointer dereferences on init failure
sch_sfq: fix null pointer dereference on init failure
sch_netem: avoid null pointer deref on init failure
sch_fq_codel: avoid double free on init failure
sch_cbq: fix null pointer dereferences on init failure
sch_hfsc: fix null pointer deref and double free on init failure
sch_hhf: fix null pointer dereference on init failure
sch_multiq: fix double free on init failure
sch_htb: fix crash on init failure
net/mlx5e: Fix CQ moderation mode not set properly
net/mlx5e: Fix inline header size for small packets
net/mlx5: E-Switch, Unload the representors in the correct order
net/mlx5e: Properly resolve TC offloaded ipv6 vxlan tunnel source address
...
|
|
Make sock_filter_is_valid_access consistent with other is_valid_access
helpers.
Requested-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: David Ahern <dsahern@gmail.com>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
After commit dce4551cb2ad ("udp: preserve head state for IP_CMSG_PASSSEC")
we preserve the secpath for the whole skb lifecycle, but we also
end up leaking a reference to it.
We must clear the head state on skb reception, if secpath is
present.
Fixes: dce4551cb2ad ("udp: preserve head state for IP_CMSG_PASSSEC")
Signed-off-by: Yossi Kuperman <yossiku@mellanox.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Commit 6bc506b4fb06 ("bridge: switchdev: Add forward mark support for
stacked devices") added the 'offload_fwd_mark' bit to the skb in order
to allow drivers to indicate to the bridge driver that they already
forwarded the packet in L2.
In case the bit is set, before transmitting the packet from each port,
the port's mark is compared with the mark stored in the skb's control
block. If both marks are equal, we know the packet arrived from a switch
device that already forwarded the packet and it's not re-transmitted.
However, if the packet is transmitted from the bridge device itself
(e.g., br0), we should clear the 'offload_fwd_mark' bit as the mark
stored in the skb's control block isn't valid.
This scenario can happen in rare cases where a packet was trapped during
L3 forwarding and forwarded by the kernel to a bridge device.
Fixes: 6bc506b4fb06 ("bridge: switchdev: Add forward mark support for stacked devices")
Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Reported-by: Yotam Gigi <yotamg@mellanox.com>
Tested-by: Yotam Gigi <yotamg@mellanox.com>
Reviewed-by: Jiri Pirko <jiri@mellanox.com>
Acked-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
The mlxsw driver relies on NETDEV_CHANGEUPPER events to configure the
device in case a port is enslaved to a master netdev such as bridge or
bond.
Since the driver ignores events unrelated to its ports and their
uppers, it's possible to engineer situations in which the device's data
path differs from the kernel's.
One example to such a situation is when a port is enslaved to a bond
that is already enslaved to a bridge. When the bond was enslaved the
driver ignored the event - as the bond wasn't one of its uppers - and
therefore a bridge port instance isn't created in the device.
Until such configurations are supported forbid them by checking that the
upper device doesn't have uppers of its own.
Fixes: 0d65fc13042f ("mlxsw: spectrum: Implement LAG port join/leave")
Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Reported-by: Nogah Frankel <nogahf@mellanox.com>
Tested-by: Nogah Frankel <nogahf@mellanox.com>
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/klassert/ipsec-next
Steffen Klassert says:
====================
pull request (net-next): ipsec-next 2017-09-01
This should be the last ipsec-next pull request for this
release cycle:
1) Support netdevice ESP trailer removal when decryption
is offloaded. From Yossi Kuperman.
2) Fix overwritten return value of copy_sec_ctx().
Please pull or let me know if there are problems.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Allow BPF programs run on sock create to use the get_current_uid_gid
helper. IPv4 and IPv6 sockets are created in a process context so
there is always a valid uid/gid
Signed-off-by: David Ahern <dsahern@gmail.com>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Add socket mark and priority to fields that can be set by
ebpf program when a socket is created.
Signed-off-by: David Ahern <dsahern@gmail.com>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
This will be used by the IPv6 host table which will be introduced in the
following patches. The fields in the header are added per-use. This header
is global and can be reused by many drivers.
Signed-off-by: Arkadi Sharshevsky <arkadis@mellanox.com>
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Add handling of IPV6_PKTOPTIONS to dccp_v6_do_rcv() in net/dccp/ipv6.c,
similar
to the handling in net/ipv6/tcp_ipv6.c
Signed-off-by: Andrii Vladyka <tulup@mail.ru>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
This extends bridge fdb table tracepoints to also cover
learned fdb entries in the br_fdb_update path. Note that
unlike other tracepoints I have moved this to when the fdb
is modified because this is in the datapath and can generate
a lot of noise in the trace output. br_fdb_update is also called
from added_by_user context in the NTF_USE case which is already
traced ..hence the !added_by_user check.
Signed-off-by: Roopa Prabhu <roopa@cumulusnetworks.com>
Acked-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
TC filters when used as classifiers are bound to TC classes.
However, there is a hidden difference when adding them in different
orders:
1. If we add tc classes before its filters, everything is fine.
Logically, the classes exist before we specify their ID's in
filters, it is easy to bind them together, just as in the current
code base.
2. If we add tc filters before the tc classes they bind, we have to
do dynamic lookup in fast path. What's worse, this happens all
the time not just once, because on fast path tcf_result is passed
on stack, there is no way to propagate back to the one in tc filters.
This hidden difference hurts performance silently if we have many tc
classes in hierarchy.
This patch intends to close this gap by doing the reverse binding when
we create a new class, in this case we can actually search all the
filters in its parent, match and fixup by classid. And because
tcf_result is specific to each type of tc filter, we have to introduce
a new ops for each filter to tell how to bind the class.
Note, we still can NOT totally get rid of those class lookup in
->enqueue() because cgroup and flow filters have no way to determine
the classid at setup time, they still have to go through dynamic lookup.
Cc: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
A recent commit added an output_mark. When copying
this output_mark, the return value of copy_sec_ctx
is overwitten without a check. Fix this by copying
the output_mark before the security context.
Fixes: 077fbac405bf ("net: xfrm: support setting an output mark.")
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
|
|
In conjunction with crypto offload [1], removing the ESP trailer by
hardware can potentially improve the performance by avoiding (1) a
cache miss incurred by reading the nexthdr field and (2) the necessity
to calculate the csum value of the trailer in order to keep skb->csum
valid.
This patch introduces the changes to the xfrm stack and merely serves
as an infrastructure. Subsequent patch to mlx5 driver will put this to
a good use.
[1] https://www.mail-archive.com/netdev@vger.kernel.org/msg175733.html
Signed-off-by: Yossi Kuperman <yossiku@mellanox.com>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
|
|
IPv4 name uses "destination ip" as does the IPv6 patch set.
Make the mac field consistent.
Signed-off-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
syzkaller had no problem to trigger a deadlock, attaching a KCM socket
to another one (or itself). (original syzkaller report was a very
confusing lockdep splat during a sendmsg())
It seems KCM claims to only support TCP, but no enforcement is done,
so we might need to add additional checks.
Fixes: ab7ac4eb9832 ("kcm: Kernel Connection Multiplexor module")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: Dmitry Vyukov <dvyukov@google.com>
Acked-by: Tom Herbert <tom@quantonium.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
sch_tbf calls qdisc_watchdog_cancel() in both its ->reset and ->destroy
callbacks but it may fail before the timer is initialized due to missing
options (either not supplied by user-space or set as a default qdisc),
also q->qdisc is used by ->reset and ->destroy so we need it initialized.
Reproduce:
$ sysctl net.core.default_qdisc=tbf
$ ip l set ethX up
Crash log:
[ 959.160172] BUG: unable to handle kernel NULL pointer dereference at 0000000000000018
[ 959.160323] IP: qdisc_reset+0xa/0x5c
[ 959.160400] PGD 59cdb067
[ 959.160401] P4D 59cdb067
[ 959.160466] PUD 59ccb067
[ 959.160532] PMD 0
[ 959.160597]
[ 959.160706] Oops: 0000 [#1] SMP
[ 959.160778] Modules linked in: sch_tbf sch_sfb sch_prio sch_netem
[ 959.160891] CPU: 2 PID: 1562 Comm: ip Not tainted 4.13.0-rc6+ #62
[ 959.160998] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.7.5-20140531_083030-gandalf 04/01/2014
[ 959.161157] task: ffff880059c9a700 task.stack: ffff8800376d0000
[ 959.161263] RIP: 0010:qdisc_reset+0xa/0x5c
[ 959.161347] RSP: 0018:ffff8800376d3610 EFLAGS: 00010286
[ 959.161531] RAX: ffffffffa001b1dd RBX: ffff8800373a2800 RCX: 0000000000000000
[ 959.161733] RDX: ffffffff8215f160 RSI: ffffffff8215f160 RDI: 0000000000000000
[ 959.161939] RBP: ffff8800376d3618 R08: 00000000014080c0 R09: 00000000ffffffff
[ 959.162141] R10: ffff8800376d3578 R11: 0000000000000020 R12: ffffffffa001d2c0
[ 959.162343] R13: ffff880037538000 R14: 00000000ffffffff R15: 0000000000000001
[ 959.162546] FS: 00007fcc5126b740(0000) GS:ffff88005d900000(0000) knlGS:0000000000000000
[ 959.162844] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 959.163030] CR2: 0000000000000018 CR3: 000000005abc4000 CR4: 00000000000406e0
[ 959.163233] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 959.163436] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[ 959.163638] Call Trace:
[ 959.163788] tbf_reset+0x19/0x64 [sch_tbf]
[ 959.163957] qdisc_destroy+0x8b/0xe5
[ 959.164119] qdisc_create_dflt+0x86/0x94
[ 959.164284] ? dev_activate+0x129/0x129
[ 959.164449] attach_one_default_qdisc+0x36/0x63
[ 959.164623] netdev_for_each_tx_queue+0x3d/0x48
[ 959.164795] dev_activate+0x4b/0x129
[ 959.164957] __dev_open+0xe7/0x104
[ 959.165118] __dev_change_flags+0xc6/0x15c
[ 959.165287] dev_change_flags+0x25/0x59
[ 959.165451] do_setlink+0x30c/0xb3f
[ 959.165613] ? check_chain_key+0xb0/0xfd
[ 959.165782] rtnl_newlink+0x3a4/0x729
[ 959.165947] ? rtnl_newlink+0x117/0x729
[ 959.166121] ? ns_capable_common+0xd/0xb1
[ 959.166288] ? ns_capable+0x13/0x15
[ 959.166450] rtnetlink_rcv_msg+0x188/0x197
[ 959.166617] ? rcu_read_unlock+0x3e/0x5f
[ 959.166783] ? rtnl_newlink+0x729/0x729
[ 959.166948] netlink_rcv_skb+0x6c/0xce
[ 959.167113] rtnetlink_rcv+0x23/0x2a
[ 959.167273] netlink_unicast+0x103/0x181
[ 959.167439] netlink_sendmsg+0x326/0x337
[ 959.167607] sock_sendmsg_nosec+0x14/0x3f
[ 959.167772] sock_sendmsg+0x29/0x2e
[ 959.167932] ___sys_sendmsg+0x209/0x28b
[ 959.168098] ? do_raw_spin_unlock+0xcd/0xf8
[ 959.168267] ? _raw_spin_unlock+0x27/0x31
[ 959.168432] ? __handle_mm_fault+0x651/0xdb1
[ 959.168602] ? check_chain_key+0xb0/0xfd
[ 959.168773] __sys_sendmsg+0x45/0x63
[ 959.168934] ? __sys_sendmsg+0x45/0x63
[ 959.169100] SyS_sendmsg+0x19/0x1b
[ 959.169260] entry_SYSCALL_64_fastpath+0x23/0xc2
[ 959.169432] RIP: 0033:0x7fcc5097e690
[ 959.169592] RSP: 002b:00007ffd0d5c7b48 EFLAGS: 00000246 ORIG_RAX: 000000000000002e
[ 959.169887] RAX: ffffffffffffffda RBX: ffffffff810d278c RCX: 00007fcc5097e690
[ 959.170089] RDX: 0000000000000000 RSI: 00007ffd0d5c7b90 RDI: 0000000000000003
[ 959.170292] RBP: ffff8800376d3f98 R08: 0000000000000001 R09: 0000000000000003
[ 959.170494] R10: 00007ffd0d5c7910 R11: 0000000000000246 R12: 0000000000000006
[ 959.170697] R13: 000000000066f1a0 R14: 00007ffd0d5cfc40 R15: 0000000000000000
[ 959.170900] ? trace_hardirqs_off_caller+0xa7/0xcf
[ 959.171076] Code: 00 41 c7 84 24 14 01 00 00 00 00 00 00 41 c7 84 24
98 00 00 00 00 00 00 00 41 5c 41 5d 41 5e 5d c3 66 66 66 66 90 55 48 89
e5 53 <48> 8b 47 18 48 89 fb 48 8b 40 48 48 85 c0 74 02 ff d0 48 8b bb
[ 959.171637] RIP: qdisc_reset+0xa/0x5c RSP: ffff8800376d3610
[ 959.171821] CR2: 0000000000000018
Fixes: 87b60cfacf9f ("net_sched: fix error recovery at qdisc creation")
Fixes: 0fbbeb1ba43b ("[PKT_SCHED]: Fix missing qdisc_destroy() in qdisc_create_dflt()")
Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Currently only a memory allocation failure can lead to this, so let's
initialize the timer first.
Fixes: 6529eaba33f0 ("net: sched: introduce tcf block infractructure")
Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
netem can fail in ->init due to missing options (either not supplied by
user-space or used as a default qdisc) causing a timer->base null
pointer deref in its ->destroy() and ->reset() callbacks.
Reproduce:
$ sysctl net.core.default_qdisc=netem
$ ip l set ethX up
Crash log:
[ 1814.846943] BUG: unable to handle kernel NULL pointer dereference at (null)
[ 1814.847181] IP: hrtimer_active+0x17/0x8a
[ 1814.847270] PGD 59c34067
[ 1814.847271] P4D 59c34067
[ 1814.847337] PUD 37374067
[ 1814.847403] PMD 0
[ 1814.847468]
[ 1814.847582] Oops: 0000 [#1] SMP
[ 1814.847655] Modules linked in: sch_netem(O) sch_fq_codel(O)
[ 1814.847761] CPU: 3 PID: 1573 Comm: ip Tainted: G O 4.13.0-rc6+ #62
[ 1814.847884] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.7.5-20140531_083030-gandalf 04/01/2014
[ 1814.848043] task: ffff88003723a700 task.stack: ffff88005adc8000
[ 1814.848235] RIP: 0010:hrtimer_active+0x17/0x8a
[ 1814.848407] RSP: 0018:ffff88005adcb590 EFLAGS: 00010246
[ 1814.848590] RAX: 0000000000000000 RBX: ffff880058e359d8 RCX: 0000000000000000
[ 1814.848793] RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffff880058e359d8
[ 1814.848998] RBP: ffff88005adcb5b0 R08: 00000000014080c0 R09: 00000000ffffffff
[ 1814.849204] R10: ffff88005adcb660 R11: 0000000000000020 R12: 0000000000000000
[ 1814.849410] R13: ffff880058e359d8 R14: 00000000ffffffff R15: 0000000000000001
[ 1814.849616] FS: 00007f733bbca740(0000) GS:ffff88005d980000(0000) knlGS:0000000000000000
[ 1814.849919] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 1814.850107] CR2: 0000000000000000 CR3: 0000000059f0d000 CR4: 00000000000406e0
[ 1814.850313] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 1814.850518] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[ 1814.850723] Call Trace:
[ 1814.850875] hrtimer_try_to_cancel+0x1a/0x93
[ 1814.851047] hrtimer_cancel+0x15/0x20
[ 1814.851211] qdisc_watchdog_cancel+0x12/0x14
[ 1814.851383] netem_reset+0xe6/0xed [sch_netem]
[ 1814.851561] qdisc_destroy+0x8b/0xe5
[ 1814.851723] qdisc_create_dflt+0x86/0x94
[ 1814.851890] ? dev_activate+0x129/0x129
[ 1814.852057] attach_one_default_qdisc+0x36/0x63
[ 1814.852232] netdev_for_each_tx_queue+0x3d/0x48
[ 1814.852406] dev_activate+0x4b/0x129
[ 1814.852569] __dev_open+0xe7/0x104
[ 1814.852730] __dev_change_flags+0xc6/0x15c
[ 1814.852899] dev_change_flags+0x25/0x59
[ 1814.853064] do_setlink+0x30c/0xb3f
[ 1814.853228] ? check_chain_key+0xb0/0xfd
[ 1814.853396] ? check_chain_key+0xb0/0xfd
[ 1814.853565] rtnl_newlink+0x3a4/0x729
[ 1814.853728] ? rtnl_newlink+0x117/0x729
[ 1814.853905] ? ns_capable_common+0xd/0xb1
[ 1814.854072] ? ns_capable+0x13/0x15
[ 1814.854234] rtnetlink_rcv_msg+0x188/0x197
[ 1814.854404] ? rcu_read_unlock+0x3e/0x5f
[ 1814.854572] ? rtnl_newlink+0x729/0x729
[ 1814.854737] netlink_rcv_skb+0x6c/0xce
[ 1814.854902] rtnetlink_rcv+0x23/0x2a
[ 1814.855064] netlink_unicast+0x103/0x181
[ 1814.855230] netlink_sendmsg+0x326/0x337
[ 1814.855398] sock_sendmsg_nosec+0x14/0x3f
[ 1814.855584] sock_sendmsg+0x29/0x2e
[ 1814.855747] ___sys_sendmsg+0x209/0x28b
[ 1814.855912] ? do_raw_spin_unlock+0xcd/0xf8
[ 1814.856082] ? _raw_spin_unlock+0x27/0x31
[ 1814.856251] ? __handle_mm_fault+0x651/0xdb1
[ 1814.856421] ? check_chain_key+0xb0/0xfd
[ 1814.856592] __sys_sendmsg+0x45/0x63
[ 1814.856755] ? __sys_sendmsg+0x45/0x63
[ 1814.856923] SyS_sendmsg+0x19/0x1b
[ 1814.857083] entry_SYSCALL_64_fastpath+0x23/0xc2
[ 1814.857256] RIP: 0033:0x7f733b2dd690
[ 1814.857419] RSP: 002b:00007ffe1d3387d8 EFLAGS: 00000246 ORIG_RAX: 000000000000002e
[ 1814.858238] RAX: ffffffffffffffda RBX: ffffffff810d278c RCX: 00007f733b2dd690
[ 1814.858445] RDX: 0000000000000000 RSI: 00007ffe1d338820 RDI: 0000000000000003
[ 1814.858651] RBP: ffff88005adcbf98 R08: 0000000000000001 R09: 0000000000000003
[ 1814.858856] R10: 00007ffe1d3385a0 R11: 0000000000000246 R12: 0000000000000002
[ 1814.859060] R13: 000000000066f1a0 R14: 00007ffe1d3408d0 R15: 0000000000000000
[ 1814.859267] ? trace_hardirqs_off_caller+0xa7/0xcf
[ 1814.859446] Code: 10 55 48 89 c7 48 89 e5 e8 45 a1 fb ff 31 c0 5d c3
31 c0 c3 66 66 66 66 90 55 48 89 e5 41 56 41 55 41 54 53 49 89 fd 49 8b
45 30 <4c> 8b 20 41 8b 5c 24 38 31 c9 31 d2 48 c7 c7 50 8e 1d 82 41 89
[ 1814.860022] RIP: hrtimer_active+0x17/0x8a RSP: ffff88005adcb590
[ 1814.860214] CR2: 0000000000000000
Fixes: 87b60cfacf9f ("net_sched: fix error recovery at qdisc creation")
Fixes: 0fbbeb1ba43b ("[PKT_SCHED]: Fix missing qdisc_destroy() in qdisc_create_dflt()")
Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
It is very unlikely to happen but the backlogs memory allocation
could fail and will free q->flows, but then ->destroy() will free
q->flows too. For correctness remove the first free and let ->destroy
clean up.
Fixes: 87b60cfacf9f ("net_sched: fix error recovery at qdisc creation")
Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
CBQ can fail on ->init by wrong nl attributes or simply for missing any,
f.e. if it's set as a default qdisc then TCA_OPTIONS (opt) will be NULL
when it is activated. The first thing init does is parse opt but it will
dereference a null pointer if used as a default qdisc, also since init
failure at default qdisc invokes ->reset() which cancels all timers then
we'll also dereference two more null pointers (timer->base) as they were
never initialized.
To reproduce:
$ sysctl net.core.default_qdisc=cbq
$ ip l set ethX up
Crash log of the first null ptr deref:
[44727.907454] BUG: unable to handle kernel NULL pointer dereference at (null)
[44727.907600] IP: cbq_init+0x27/0x205
[44727.907676] PGD 59ff4067
[44727.907677] P4D 59ff4067
[44727.907742] PUD 59c70067
[44727.907807] PMD 0
[44727.907873]
[44727.907982] Oops: 0000 [#1] SMP
[44727.908054] Modules linked in:
[44727.908126] CPU: 1 PID: 21312 Comm: ip Not tainted 4.13.0-rc6+ #60
[44727.908235] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.7.5-20140531_083030-gandalf 04/01/2014
[44727.908477] task: ffff88005ad42700 task.stack: ffff880037214000
[44727.908672] RIP: 0010:cbq_init+0x27/0x205
[44727.908838] RSP: 0018:ffff8800372175f0 EFLAGS: 00010286
[44727.909018] RAX: ffffffff816c3852 RBX: ffff880058c53800 RCX: 0000000000000000
[44727.909222] RDX: 0000000000000004 RSI: 0000000000000000 RDI: ffff8800372175f8
[44727.909427] RBP: ffff880037217650 R08: ffffffff81b0f380 R09: 0000000000000000
[44727.909631] R10: ffff880037217660 R11: 0000000000000020 R12: ffffffff822a44c0
[44727.909835] R13: ffff880058b92000 R14: 00000000ffffffff R15: 0000000000000001
[44727.910040] FS: 00007ff8bc583740(0000) GS:ffff88005d880000(0000) knlGS:0000000000000000
[44727.910339] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[44727.910525] CR2: 0000000000000000 CR3: 00000000371e5000 CR4: 00000000000406e0
[44727.910731] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[44727.910936] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[44727.911141] Call Trace:
[44727.911291] ? lockdep_init_map+0xb6/0x1ba
[44727.911461] ? qdisc_alloc+0x14e/0x187
[44727.911626] qdisc_create_dflt+0x7a/0x94
[44727.911794] ? dev_activate+0x129/0x129
[44727.911959] attach_one_default_qdisc+0x36/0x63
[44727.912132] netdev_for_each_tx_queue+0x3d/0x48
[44727.912305] dev_activate+0x4b/0x129
[44727.912468] __dev_open+0xe7/0x104
[44727.912631] __dev_change_flags+0xc6/0x15c
[44727.912799] dev_change_flags+0x25/0x59
[44727.912966] do_setlink+0x30c/0xb3f
[44727.913129] ? check_chain_key+0xb0/0xfd
[44727.913294] ? check_chain_key+0xb0/0xfd
[44727.913463] rtnl_newlink+0x3a4/0x729
[44727.913626] ? rtnl_newlink+0x117/0x729
[44727.913801] ? ns_capable_common+0xd/0xb1
[44727.913968] ? ns_capable+0x13/0x15
[44727.914131] rtnetlink_rcv_msg+0x188/0x197
[44727.914300] ? rcu_read_unlock+0x3e/0x5f
[44727.914465] ? rtnl_newlink+0x729/0x729
[44727.914630] netlink_rcv_skb+0x6c/0xce
[44727.914796] rtnetlink_rcv+0x23/0x2a
[44727.914956] netlink_unicast+0x103/0x181
[44727.915122] netlink_sendmsg+0x326/0x337
[44727.915291] sock_sendmsg_nosec+0x14/0x3f
[44727.915459] sock_sendmsg+0x29/0x2e
[44727.915619] ___sys_sendmsg+0x209/0x28b
[44727.915784] ? do_raw_spin_unlock+0xcd/0xf8
[44727.915954] ? _raw_spin_unlock+0x27/0x31
[44727.916121] ? __handle_mm_fault+0x651/0xdb1
[44727.916290] ? check_chain_key+0xb0/0xfd
[44727.916461] __sys_sendmsg+0x45/0x63
[44727.916626] ? __sys_sendmsg+0x45/0x63
[44727.916792] SyS_sendmsg+0x19/0x1b
[44727.916950] entry_SYSCALL_64_fastpath+0x23/0xc2
[44727.917125] RIP: 0033:0x7ff8bbc96690
[44727.917286] RSP: 002b:00007ffc360991e8 EFLAGS: 00000246 ORIG_RAX: 000000000000002e
[44727.917579] RAX: ffffffffffffffda RBX: ffffffff810d278c RCX: 00007ff8bbc96690
[44727.917783] RDX: 0000000000000000 RSI: 00007ffc36099230 RDI: 0000000000000003
[44727.917987] RBP: ffff880037217f98 R08: 0000000000000001 R09: 0000000000000003
[44727.918190] R10: 00007ffc36098fb0 R11: 0000000000000246 R12: 0000000000000006
[44727.918393] R13: 000000000066f1a0 R14: 00007ffc360a12e0 R15: 0000000000000000
[44727.918597] ? trace_hardirqs_off_caller+0xa7/0xcf
[44727.918774] Code: 41 5f 5d c3 66 66 66 66 90 55 48 8d 56 04 45 31 c9
49 c7 c0 80 f3 b0 81 48 89 e5 41 55 41 54 53 48 89 fb 48 8d 7d a8 48 83
ec 48 <0f> b7 0e be 07 00 00 00 83 e9 04 e8 e6 f7 d8 ff 85 c0 0f 88 bb
[44727.919332] RIP: cbq_init+0x27/0x205 RSP: ffff8800372175f0
[44727.919516] CR2: 0000000000000000
Fixes: 0fbbeb1ba43b ("[PKT_SCHED]: Fix missing qdisc_destroy() in qdisc_create_dflt()")
Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Depending on where ->init fails we can get a null pointer deref due to
uninitialized hires timer (watchdog) or a double free of the qdisc hash
because it is already freed by ->destroy().
Fixes: 8d5537387505 ("net/sched/hfsc: allocate tcf block for hfsc root class")
Fixes: 87b60cfacf9f ("net_sched: fix error recovery at qdisc creation")
Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
If sch_hhf fails in its ->init() function (either due to wrong
user-space arguments as below or memory alloc failure of hh_flows) it
will do a null pointer deref of q->hh_flows in its ->destroy() function.
To reproduce the crash:
$ tc qdisc add dev eth0 root hhf quantum 2000000 non_hh_weight 10000000
Crash log:
[ 690.654882] BUG: unable to handle kernel NULL pointer dereference at (null)
[ 690.655565] IP: hhf_destroy+0x48/0xbc
[ 690.655944] PGD 37345067
[ 690.655948] P4D 37345067
[ 690.656252] PUD 58402067
[ 690.656554] PMD 0
[ 690.656857]
[ 690.657362] Oops: 0000 [#1] SMP
[ 690.657696] Modules linked in:
[ 690.658032] CPU: 3 PID: 920 Comm: tc Not tainted 4.13.0-rc6+ #57
[ 690.658525] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.7.5-20140531_083030-gandalf 04/01/2014
[ 690.659255] task: ffff880058578000 task.stack: ffff88005acbc000
[ 690.659747] RIP: 0010:hhf_destroy+0x48/0xbc
[ 690.660146] RSP: 0018:ffff88005acbf9e0 EFLAGS: 00010246
[ 690.660601] RAX: 0000000000000000 RBX: 0000000000000020 RCX: 0000000000000000
[ 690.661155] RDX: 0000000000000000 RSI: 0000000000000001 RDI: ffffffff821f63f0
[ 690.661710] RBP: ffff88005acbfa08 R08: ffffffff81b10a90 R09: 0000000000000000
[ 690.662267] R10: 00000000f42b7019 R11: ffff880058578000 R12: 00000000ffffffea
[ 690.662820] R13: ffff8800372f6400 R14: 0000000000000000 R15: 0000000000000000
[ 690.663769] FS: 00007f8ae5e8b740(0000) GS:ffff88005d980000(0000) knlGS:0000000000000000
[ 690.667069] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 690.667965] CR2: 0000000000000000 CR3: 0000000058523000 CR4: 00000000000406e0
[ 690.668918] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 690.669945] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[ 690.671003] Call Trace:
[ 690.671743] qdisc_create+0x377/0x3fd
[ 690.672534] tc_modify_qdisc+0x4d2/0x4fd
[ 690.673324] rtnetlink_rcv_msg+0x188/0x197
[ 690.674204] ? rcu_read_unlock+0x3e/0x5f
[ 690.675091] ? rtnl_newlink+0x729/0x729
[ 690.675877] netlink_rcv_skb+0x6c/0xce
[ 690.676648] rtnetlink_rcv+0x23/0x2a
[ 690.677405] netlink_unicast+0x103/0x181
[ 690.678179] netlink_sendmsg+0x326/0x337
[ 690.678958] sock_sendmsg_nosec+0x14/0x3f
[ 690.679743] sock_sendmsg+0x29/0x2e
[ 690.680506] ___sys_sendmsg+0x209/0x28b
[ 690.681283] ? __handle_mm_fault+0xc7d/0xdb1
[ 690.681915] ? check_chain_key+0xb0/0xfd
[ 690.682449] __sys_sendmsg+0x45/0x63
[ 690.682954] ? __sys_sendmsg+0x45/0x63
[ 690.683471] SyS_sendmsg+0x19/0x1b
[ 690.683974] entry_SYSCALL_64_fastpath+0x23/0xc2
[ 690.684516] RIP: 0033:0x7f8ae529d690
[ 690.685016] RSP: 002b:00007fff26d2d6b8 EFLAGS: 00000246 ORIG_RAX: 000000000000002e
[ 690.685931] RAX: ffffffffffffffda RBX: ffffffff810d278c RCX: 00007f8ae529d690
[ 690.686573] RDX: 0000000000000000 RSI: 00007fff26d2d700 RDI: 0000000000000003
[ 690.687047] RBP: ffff88005acbff98 R08: 0000000000000001 R09: 0000000000000000
[ 690.687519] R10: 00007fff26d2d480 R11: 0000000000000246 R12: 0000000000000002
[ 690.687996] R13: 0000000001258070 R14: 0000000000000001 R15: 0000000000000000
[ 690.688475] ? trace_hardirqs_off_caller+0xa7/0xcf
[ 690.688887] Code: 00 00 e8 2a 02 ae ff 49 8b bc 1d 60 02 00 00 48 83
c3 08 e8 19 02 ae ff 48 83 fb 20 75 dc 45 31 f6 4d 89 f7 4d 03 bd 20 02
00 00 <49> 8b 07 49 39 c7 75 24 49 83 c6 10 49 81 fe 00 40 00 00 75 e1
[ 690.690200] RIP: hhf_destroy+0x48/0xbc RSP: ffff88005acbf9e0
[ 690.690636] CR2: 0000000000000000
Fixes: 87b60cfacf9f ("net_sched: fix error recovery at qdisc creation")
Fixes: 10239edf86f1 ("net-qdisc-hhf: Heavy-Hitter Filter (HHF) qdisc")
Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
The below commit added a call to ->destroy() on init failure, but multiq
still frees ->queues on error in init, but ->queues is also freed by
->destroy() thus we get double free and corrupted memory.
Very easy to reproduce (eth0 not multiqueue):
$ tc qdisc add dev eth0 root multiq
RTNETLINK answers: Operation not supported
$ ip l add dumdum type dummy
(crash)
Trace log:
[ 3929.467747] general protection fault: 0000 [#1] SMP
[ 3929.468083] Modules linked in:
[ 3929.468302] CPU: 3 PID: 967 Comm: ip Not tainted 4.13.0-rc6+ #56
[ 3929.468625] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.7.5-20140531_083030-gandalf 04/01/2014
[ 3929.469124] task: ffff88003716a700 task.stack: ffff88005872c000
[ 3929.469449] RIP: 0010:__kmalloc_track_caller+0x117/0x1be
[ 3929.469746] RSP: 0018:ffff88005872f6a0 EFLAGS: 00010246
[ 3929.470042] RAX: 00000000000002de RBX: 0000000058a59000 RCX: 00000000000002df
[ 3929.470406] RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffffffff821f7020
[ 3929.470770] RBP: ffff88005872f6e8 R08: 000000000001f010 R09: 0000000000000000
[ 3929.471133] R10: ffff88005872f730 R11: 0000000000008cdd R12: ff006d75646d7564
[ 3929.471496] R13: 00000000014000c0 R14: ffff88005b403c00 R15: ffff88005b403c00
[ 3929.471869] FS: 00007f0b70480740(0000) GS:ffff88005d980000(0000) knlGS:0000000000000000
[ 3929.472286] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 3929.472677] CR2: 00007ffcee4f3000 CR3: 0000000059d45000 CR4: 00000000000406e0
[ 3929.473209] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 3929.474109] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[ 3929.474873] Call Trace:
[ 3929.475337] ? kstrdup_const+0x23/0x25
[ 3929.475863] kstrdup+0x2e/0x4b
[ 3929.476338] kstrdup_const+0x23/0x25
[ 3929.478084] __kernfs_new_node+0x28/0xbc
[ 3929.478478] kernfs_new_node+0x35/0x55
[ 3929.478929] kernfs_create_link+0x23/0x76
[ 3929.479478] sysfs_do_create_link_sd.isra.2+0x85/0xd7
[ 3929.480096] sysfs_create_link+0x33/0x35
[ 3929.480649] device_add+0x200/0x589
[ 3929.481184] netdev_register_kobject+0x7c/0x12f
[ 3929.481711] register_netdevice+0x373/0x471
[ 3929.482174] rtnl_newlink+0x614/0x729
[ 3929.482610] ? rtnl_newlink+0x17f/0x729
[ 3929.483080] rtnetlink_rcv_msg+0x188/0x197
[ 3929.483533] ? rcu_read_unlock+0x3e/0x5f
[ 3929.483984] ? rtnl_newlink+0x729/0x729
[ 3929.484420] netlink_rcv_skb+0x6c/0xce
[ 3929.484858] rtnetlink_rcv+0x23/0x2a
[ 3929.485291] netlink_unicast+0x103/0x181
[ 3929.485735] netlink_sendmsg+0x326/0x337
[ 3929.486181] sock_sendmsg_nosec+0x14/0x3f
[ 3929.486614] sock_sendmsg+0x29/0x2e
[ 3929.486973] ___sys_sendmsg+0x209/0x28b
[ 3929.487340] ? do_raw_spin_unlock+0xcd/0xf8
[ 3929.487719] ? _raw_spin_unlock+0x27/0x31
[ 3929.488092] ? __handle_mm_fault+0x651/0xdb1
[ 3929.488471] ? check_chain_key+0xb0/0xfd
[ 3929.488847] __sys_sendmsg+0x45/0x63
[ 3929.489206] ? __sys_sendmsg+0x45/0x63
[ 3929.489576] SyS_sendmsg+0x19/0x1b
[ 3929.489901] entry_SYSCALL_64_fastpath+0x23/0xc2
[ 3929.490172] RIP: 0033:0x7f0b6fb93690
[ 3929.490423] RSP: 002b:00007ffcee4ed588 EFLAGS: 00000246 ORIG_RAX: 000000000000002e
[ 3929.490881] RAX: ffffffffffffffda RBX: ffffffff810d278c RCX: 00007f0b6fb93690
[ 3929.491198] RDX: 0000000000000000 RSI: 00007ffcee4ed5d0 RDI: 0000000000000003
[ 3929.491521] RBP: ffff88005872ff98 R08: 0000000000000001 R09: 0000000000000000
[ 3929.491801] R10: 00007ffcee4ed350 R11: 0000000000000246 R12: 0000000000000002
[ 3929.492075] R13: 000000000066f1a0 R14: 00007ffcee4f5680 R15: 0000000000000000
[ 3929.492352] ? trace_hardirqs_off_caller+0xa7/0xcf
[ 3929.492590] Code: 8b 45 c0 48 8b 45 b8 74 17 48 8b 4d c8 83 ca ff 44
89 ee 4c 89 f7 e8 83 ca ff ff 49 89 c4 eb 49 49 63 56 20 48 8d 48 01 4d
8b 06 <49> 8b 1c 14 48 89 c2 4c 89 e0 65 49 0f c7 08 0f 94 c0 83 f0 01
[ 3929.493335] RIP: __kmalloc_track_caller+0x117/0x1be RSP: ffff88005872f6a0
Fixes: 87b60cfacf9f ("net_sched: fix error recovery at qdisc creation")
Fixes: f07d1501292b ("multiq: Further multiqueue cleanup")
Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
The commit below added a call to the ->destroy() callback for all qdiscs
which failed in their ->init(), but some were not prepared for such
change and can't handle partially initialized qdisc. HTB is one of them
and if any error occurs before the qdisc watchdog timer and qdisc work are
initialized then we can hit either a null ptr deref (timer->base) when
canceling in ->destroy or lockdep error info about trying to register
a non-static key and a stack dump. So to fix these two move the watchdog
timer and workqueue init before anything that can err out.
To reproduce userspace needs to send broken htb qdisc create request,
tested with a modified tc (q_htb.c).
Trace log:
[ 2710.897602] BUG: unable to handle kernel NULL pointer dereference at (null)
[ 2710.897977] IP: hrtimer_active+0x17/0x8a
[ 2710.898174] PGD 58fab067
[ 2710.898175] P4D 58fab067
[ 2710.898353] PUD 586c0067
[ 2710.898531] PMD 0
[ 2710.898710]
[ 2710.899045] Oops: 0000 [#1] SMP
[ 2710.899232] Modules linked in:
[ 2710.899419] CPU: 1 PID: 950 Comm: tc Not tainted 4.13.0-rc6+ #54
[ 2710.899646] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.7.5-20140531_083030-gandalf 04/01/2014
[ 2710.900035] task: ffff880059ed2700 task.stack: ffff88005ad4c000
[ 2710.900262] RIP: 0010:hrtimer_active+0x17/0x8a
[ 2710.900467] RSP: 0018:ffff88005ad4f960 EFLAGS: 00010246
[ 2710.900684] RAX: 0000000000000000 RBX: ffff88003701e298 RCX: 0000000000000000
[ 2710.900933] RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffff88003701e298
[ 2710.901177] RBP: ffff88005ad4f980 R08: 0000000000000001 R09: 0000000000000001
[ 2710.901419] R10: ffff88005ad4f800 R11: 0000000000000400 R12: 0000000000000000
[ 2710.901663] R13: ffff88003701e298 R14: ffffffff822a4540 R15: ffff88005ad4fac0
[ 2710.901907] FS: 00007f2f5e90f740(0000) GS:ffff88005d880000(0000) knlGS:0000000000000000
[ 2710.902277] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 2710.902500] CR2: 0000000000000000 CR3: 0000000058ca3000 CR4: 00000000000406e0
[ 2710.902744] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 2710.902977] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[ 2710.903180] Call Trace:
[ 2710.903332] hrtimer_try_to_cancel+0x1a/0x93
[ 2710.903504] hrtimer_cancel+0x15/0x20
[ 2710.903667] qdisc_watchdog_cancel+0x12/0x14
[ 2710.903866] htb_destroy+0x2e/0xf7
[ 2710.904097] qdisc_create+0x377/0x3fd
[ 2710.904330] tc_modify_qdisc+0x4d2/0x4fd
[ 2710.904511] rtnetlink_rcv_msg+0x188/0x197
[ 2710.904682] ? rcu_read_unlock+0x3e/0x5f
[ 2710.904849] ? rtnl_newlink+0x729/0x729
[ 2710.905017] netlink_rcv_skb+0x6c/0xce
[ 2710.905183] rtnetlink_rcv+0x23/0x2a
[ 2710.905345] netlink_unicast+0x103/0x181
[ 2710.905511] netlink_sendmsg+0x326/0x337
[ 2710.905679] sock_sendmsg_nosec+0x14/0x3f
[ 2710.905847] sock_sendmsg+0x29/0x2e
[ 2710.906010] ___sys_sendmsg+0x209/0x28b
[ 2710.906176] ? do_raw_spin_unlock+0xcd/0xf8
[ 2710.906346] ? _raw_spin_unlock+0x27/0x31
[ 2710.906514] ? __handle_mm_fault+0x651/0xdb1
[ 2710.906685] ? check_chain_key+0xb0/0xfd
[ 2710.906855] __sys_sendmsg+0x45/0x63
[ 2710.907018] ? __sys_sendmsg+0x45/0x63
[ 2710.907185] SyS_sendmsg+0x19/0x1b
[ 2710.907344] entry_SYSCALL_64_fastpath+0x23/0xc2
Note that probably this bug goes further back because the default qdisc
handling always calls ->destroy on init failure too.
Fixes: 87b60cfacf9f ("net_sched: fix error recovery at qdisc creation")
Fixes: 0fbbeb1ba43b ("[PKT_SCHED]: Fix missing qdisc_destroy() in qdisc_create_dflt()")
Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
IPv6 packet may carry more than one extension header, and IPv6 nodes must
accept and attempt to process extension headers in any order and occurring
any number of times in the same packet. Hence, there should be no
assumption that Segment Routing extension header is to appear immediately
after the IPv6 header.
Moreover, section 4.1 of RFC 8200 gives a recommendation on the order of
appearance of those extension headers within an IPv6 packet. According to
this recommendation, Segment Routing extension header should appear after
Hop-by-Hop and Destination Options headers (if they present).
This patch fixes the get_srh(), so it gets the segment routing header
regardless of its position in the chain of the extension headers in IPv6
packet, and makes sure that the IPv6 routing extension header is of Type 4.
Signed-off-by: Ahmed Abdelsalam <amsalam20@gmail.com>
Acked-by: David Lebrun <david.lebrun@uclouvain.be>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Typically, each TC filter has its own action. All the actions of the
same type are saved in its hash table. But the hash buckets are too
small that it degrades to a list. And the performance is greatly
affected. For example, it takes about 0m11.914s to insert 64K rules.
If we convert the hash table to IDR, it only takes about 0m1.500s.
The improvement is huge.
But please note that the test result is based on previous patch that
cls_flower uses IDR.
Signed-off-by: Chris Mi <chrism@mellanox.com>
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Acked-by: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Currently, all filters with the same priority are linked in a doubly
linked list. Every filter should have a unique handle. To make the
handle unique, we need to iterate the list every time to see if the
handle exists or not when inserting a new filter. It is time-consuming.
For example, it takes about 5m3.169s to insert 64K rules.
This patch changes cls_flower to use IDR. With this patch, it
takes about 0m1.127s to insert 64K rules. The improvement is huge.
But please note that in this testing, all filters share the same action.
If every filter has a unique action, that is another bottleneck.
Follow-up patch in this patchset addresses that.
Signed-off-by: Chris Mi <chrism@mellanox.com>
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Acked-by: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
The legacy ioctl interfaces are only useful for BR/EDR operation and
since Linux 3.4 no longer needed anyway. This options allows disabling
them alltogether and use only management interfaces for setup and
control.
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
Signed-off-by: Johan Hedberg <johan.hedberg@intel.com>
|
|
This reverts commit 45f119bf936b1f9f546a0b139c5b56f9bb2bdc78.
Eric Dumazet says:
We found at Google a significant regression caused by
45f119bf936b1f9f546a0b139c5b56f9bb2bdc78 tcp: remove header prediction
In typical RPC (TCP_RR), when a TCP socket receives data, we now call
tcp_ack() while we used to not call it.
This touches enough cache lines to cause a slowdown.
so problem does not seem to be HP removal itself but the tcp_ack()
call. Therefore, it might be possible to remove HP after all, provided
one finds a way to elide tcp_ack for most cases.
Reported-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
This change was a followup to the header prediction removal,
so first revert this as a prerequisite to back out hp removal.
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
|