Age | Commit message (Collapse) | Author |
|
Make sure that sendmsg() gets woken up if the call it is waiting for
completes abnormally.
Signed-off-by: David Howells <dhowells@redhat.com>
|
|
Don't send an IDLE ACK at the end of the transmission of the response to a
service call. The service end resends DATA packets until the client sends an
ACK that hard-acks all the send data. At that point, the call is complete.
Signed-off-by: David Howells <dhowells@redhat.com>
|
|
Set the timestamp on sk_buffs holding packets to be transmitted before
queueing them because the moment the packet is on the queue it can be seen
by the retransmission algorithm - which may see a completely random
timestamp.
If the retransmission algorithm sees such a timestamp, it may retransmit
the packet and, in future, tell the congestion management algorithm that
the retransmit timer expired.
Signed-off-by: David Howells <dhowells@redhat.com>
|
|
It looks like the following patch can make FQ very precise, even in VM
or stressed hosts. It matters at high pacing rates.
We take into account the difference between the time that was programmed
when last packet was sent, and current time (a drift of tens of usecs is
often observed)
Add an EWMA of the unthrottle latency to help diagnostics.
This latency is the difference between current time and oldest packet in
delayed RB-tree. This accounts for the high resolution timer latency,
but can be different under stress, as fq_check_throttled() can be
opportunistically be called from a dequeue() called after an enqueue()
for a different flow.
Tested:
// Start a 10Gbit flow
$ netperf --google-pacing-rate 1250000000 -H lpaa24 -l 10000 -- -K bbr &
Before patch :
$ sar -n DEV 10 5 | grep eth0 | grep Average
Average: eth0 17106.04 756876.84 1102.75 1119049.02 0.00 0.00 0.52
After patch :
$ sar -n DEV 10 5 | grep eth0 | grep Average
Average: eth0 17867.00 800245.90 1151.77 1183172.12 0.00 0.00 0.52
A new iproute2 tc can output the 'unthrottle latency' :
$ tc -s qd sh dev eth0 | grep latency
0 gc, 0 highprio, 32490767 throttled, 2382 ns latency
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
sctp_acked() is using 32bit arithmetics on 16bits vars, via TSN_lte()
macros, which is weird and confusing.
Once the offset to ctsn is calculated, all wrapping is already handled
and thus to verify the Gap Ack blocks we can just use pure
less/big-or-equal than checks.
Also, rename gap variable to tsn_offset, so it's more meaningful, as
it doesn't point to any gap at all.
Even so, I don't think this discrepancy resulted in any practical bug.
This patch is a preparation for the next one, which will introduce
typecheck() for TSN_lte() macros and would cause a compile error here.
Suggested-by: David Laight <David.Laight@ACULAB.COM>
Reported-by: David Laight <David.Laight@ACULAB.COM>
Signed-off-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
On error path in route4_change(), 'f' could be NULL,
so we should check NULL before calling tcf_exts_destroy().
Fixes: b9a24bb76bf6 ("net_sched: properly handle failure case of tcf_exts_init()")
Reported-by: kbuild test robot <fengguang.wu@intel.com>
Cc: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Acked-by: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
|
|
We already checked for !found just a bit before:
if (!found) {
regs->verdict.code = NFT_BREAK;
return;
}
if (found && set->flags & NFT_SET_MAP)
^^^^^
So this redundant check can just go away.
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
|
|
It's better to use sizeof(info->name)-1 as index to force set the string
tail instead of literal number '29'.
Signed-off-by: Gao Feng <fgao@ikuai8.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
|
|
There are some codes which are used to get one random once in netfilter.
We could use net_get_random_once to simplify these codes.
Signed-off-by: Gao Feng <fgao@ikuai8.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
|
|
pkt->xt.thoff is not always set properly, but we use it without any check.
For payload expr, it will cause wrong results. For nftrace, we may notify
the wrong network or transport header to the user space, furthermore,
input the following nft rules, warning message will be printed out:
# nft add rule arp filter output meta nftrace set 1
WARNING: CPU: 0 PID: 13428 at net/netfilter/nf_tables_trace.c:263
nft_trace_notify+0x4a3/0x5e0 [nf_tables]
Call Trace:
[<ffffffff813d58ae>] dump_stack+0x63/0x85
[<ffffffff810a4c0b>] __warn+0xcb/0xf0
[<ffffffff810a4d3d>] warn_slowpath_null+0x1d/0x20
[<ffffffffa0589703>] nft_trace_notify+0x4a3/0x5e0 [nf_tables]
[ ... ]
[<ffffffffa05690a8>] nft_do_chain_arp+0x78/0x90 [nf_tables_arp]
[<ffffffff816f4aa2>] nf_iterate+0x62/0x80
[<ffffffff816f4b33>] nf_hook_slow+0x73/0xd0
[<ffffffff81732bbf>] arp_xmit+0x8f/0xb0
[ ... ]
[<ffffffff81732d36>] arp_solicit+0x106/0x2c0
So before we use pkt->xt.thoff, check the tprot_set first.
Signed-off-by: Liping Zhang <liping.zhang@spreadtrum.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
|
|
There's an off-by-one issue in nft_payload_fast_eval, skb_tail_pointer
and ptr + priv->len all point to the last valid address plus 1. So if
they are equal, we can still fetch the valid data. It's unnecessary to
fall back to nft_payload_eval.
Signed-off-by: Liping Zhang <liping.zhang@spreadtrum.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
|
|
Currently, the user can specify the queue numbers by _QUEUE_NUM and
_QUEUE_TOTAL attributes, this is enough in most situations.
But acctually, it is not very flexible, for example:
tcp dport 80 mapped to queue0
tcp dport 81 mapped to queue1
tcp dport 82 mapped to queue2
In order to do this thing, we must add 3 nft rules, and more
mapping meant more rules ...
So take one register to select the queue number, then we can add one
simple rule to mapping queues, maybe like this:
queue num tcp dport map { 80:0, 81:1, 82:2 ... }
Florian Westphal also proposed wider usage scenarios:
queue num jhash ip saddr . ip daddr mod ...
queue num meta cpu ...
queue num meta mark ...
The last point is how to load a queue number from sreg, although we can
use *(u16*)®s->data[reg] to load the queue number, just like nat expr
to load its l4port do.
But we will cooperate with hash expr, meta cpu, meta mark expr and so on.
They all store the result to u32 type, so cast it to u16 pointer and
dereference it will generate wrong result in the big endian system.
So just keep it simple, we treat queue number as u32 type, although u16
type is already enough.
Suggested-by: Pablo Neira Ayuso <pablo@netfilter.org>
Signed-off-by: Liping Zhang <liping.zhang@spreadtrum.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
|
|
Fetch value and validate u32 netlink attribute. This validation is
usually required when the u32 netlink attributes are being stored in a
field whose size is smaller.
This patch revisits 4da449ae1df9 ("netfilter: nft_exthdr: Add size check
on u8 nft_exthdr attributes").
Fixes: 96518518cc41 ("netfilter: add nftables")
Suggested-by: Pablo Neira Ayuso <pablo@netfilter.org>
Signed-off-by: Laura Garcia Liebana <nevola@gmail.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
|
|
Pull networking fixes from David Miller:
"Mostly small bits scattered all over the place, which is usually how
things go this late in the -rc series.
1) Proper driver init device resets in bnx2, from Baoquan He.
2) Fix accounting overflow in __tcp_retransmit_skb(),
sk_forward_alloc, and ip_idents_reserve, from Eric Dumazet.
3) Fix crash in bna driver ethtool stats handling, from Ivan Vecera.
4) Missing check of skb_linearize() return value in mac80211, from
Johannes Berg.
5) Endianness fix in nf_table_trace dumps, from Liping Zhang.
6) SSN comparison fix in SCTP, from Marcelo Ricardo Leitner.
7) Update DSA and b44 MAINTAINERS entries.
8) Make input path of vti6 driver work again, from Nicolas Dichtel.
9) Off-by-one in mlx4, from Sebastian Ott.
10) Fix fallback route lookup handling in ipv6, from Vincent Bernat.
11) Fix stack corruption on probe in qed driver, from Yuval Mintz.
12) PHY init fixes in r8152 from Hayes Wang.
13) Missing SKB free in irda_accept error path, from Phil Turnbull"
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (61 commits)
tcp: properly account Fast Open SYN-ACK retrans
tcp: fix under-accounting retransmit SNMP counters
MAINTAINERS: Update b44 maintainer.
net: get rid of an signed integer overflow in ip_idents_reserve()
net/mlx4_core: Fix to clean devlink resources
net: can: ifi: Configure transmitter delay
vti6: fix input path
ipmr, ip6mr: return lastuse relative to now
r8152: disable ALDPS and EEE before setting PHY
r8152: remove r8153_enable_eee
r8152: move PHY settings to hw_phy_cfg
r8152: move enabling PHY
r8152: move some functions
cxgb4/cxgb4vf: Allocate more queues for 25G and 100G adapter
qed: Fix stack corruption on probe
MAINTAINERS: Add an entry for the core network DSA code
net: ipv6: fallback to full lookup if table lookup is unsuitable
net/mlx5: E-Switch, Handle mode change failures
net/mlx5: E-Switch, Fix error flow in the SRIOV e-switch init code
net/mlx5: Fix flow counter bulk command out mailbox allocation
...
|
|
Scan response data should not be updated unless there
is an advertising instance.
Signed-off-by: Michał Narajowski <michal.narajowski@codecoup.pl>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
|
|
Adds missing callback assignment to cmd_complete in pending management command
context. Dump path involves security procedure performed on legacy (pre-SSP)
devices with service security requirements set to HIGH (16digits PIN).
It fails when shorter PIN is delivered by user.
[ 1.517950] Bluetooth: PIN code is not 16 bytes long
[ 1.518491] BUG: unable to handle kernel NULL pointer dereference at (null)
[ 1.518584] IP: [< (null)>] (null)
[ 1.518584] PGD 9e08067 PUD 9fdf067 PMD 0
[ 1.518584] Oops: 0010 [#1] SMP
[ 1.518584] Modules linked in:
[ 1.518584] CPU: 0 PID: 1002 Comm: kworker/u3:2 Not tainted 4.8.0-rc6-354649-gaf4168c #16
[ 1.518584] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.9.3-20160701_074356-anatol 04/01/2014
[ 1.518584] Workqueue: hci0 hci_rx_work
[ 1.518584] task: ffff880009ce14c0 task.stack: ffff880009e10000
[ 1.518584] RIP: 0010:[<0000000000000000>] [< (null)>] (null)
[ 1.518584] RSP: 0018:ffff880009e13bc8 EFLAGS: 00010293
[ 1.518584] RAX: 0000000000000000 RBX: ffff880009eed100 RCX: 0000000000000006
[ 1.518584] RDX: ffff880009ddc000 RSI: 0000000000000000 RDI: ffff880009eed100
[ 1.518584] RBP: ffff880009e13be0 R08: 0000000000000000 R09: 0000000000000001
[ 1.518584] R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000000
[ 1.518584] R13: ffff880009e13ccd R14: ffff880009ddc000 R15: ffff880009ddc010
[ 1.518584] FS: 0000000000000000(0000) GS:ffff88000bc00000(0000) knlGS:0000000000000000
[ 1.518584] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 1.518584] CR2: 0000000000000000 CR3: 0000000009fdd000 CR4: 00000000000006f0
[ 1.518584] Stack:
[ 1.518584] ffffffff81909808 ffff880009e13cce ffff880009e0d40b ffff880009e13c68
[ 1.518584] ffffffff818f428d 00000000024000c0 ffff880009e13c08 ffffffff810ca903
[ 1.518584] ffff880009e13c48 ffffffff811ade34 ffffffff8178c31f ffff880009ee6200
[ 1.518584] Call Trace:
[ 1.518584] [<ffffffff81909808>] ? mgmt_pin_code_neg_reply_complete+0x38/0x60
[ 1.518584] [<ffffffff818f428d>] hci_cmd_complete_evt+0x69d/0x3200
[ 1.518584] [<ffffffff810ca903>] ? rcu_read_lock_sched_held+0x53/0x60
[ 1.518584] [<ffffffff811ade34>] ? kmem_cache_alloc+0x1a4/0x200
[ 1.518584] [<ffffffff8178c31f>] ? skb_clone+0x4f/0xa0
[ 1.518584] [<ffffffff818f9d81>] hci_event_packet+0x8e1/0x28e0
[ 1.518584] [<ffffffff81a421f1>] ? _raw_spin_unlock_irqrestore+0x31/0x50
[ 1.518584] [<ffffffff810aea3e>] ? trace_hardirqs_on_caller+0xee/0x1b0
[ 1.518584] [<ffffffff818e6bd1>] hci_rx_work+0x1e1/0x5b0
[ 1.518584] [<ffffffff8107e4bd>] ? process_one_work+0x1ed/0x6b0
[ 1.518584] [<ffffffff8107e538>] process_one_work+0x268/0x6b0
[ 1.518584] [<ffffffff8107e4bd>] ? process_one_work+0x1ed/0x6b0
[ 1.518584] [<ffffffff8107e9c3>] worker_thread+0x43/0x4e0
[ 1.518584] [<ffffffff8107e980>] ? process_one_work+0x6b0/0x6b0
[ 1.518584] [<ffffffff8107e980>] ? process_one_work+0x6b0/0x6b0
[ 1.518584] [<ffffffff8108505f>] kthread+0xdf/0x100
[ 1.518584] [<ffffffff81a4297f>] ret_from_fork+0x1f/0x40
[ 1.518584] [<ffffffff81084f80>] ? kthread_create_on_node+0x210/0x210
Signed-off-by: Arek Lichwa <arek.lichwa@gmail.com>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
|
|
Add support of an offset value for incremental counter and random. With
this option the sysadmin is able to start the counter to a certain value
and then apply the generated number.
Example:
meta mark set numgen inc mod 2 offset 100
This will generate marks with the serie 100, 101, 100, 101, ...
Suggested-by: Pablo Neira Ayuso <pablo@netfilter.org>
Signed-off-by: Laura Garcia Liebana <nevola@gmail.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs
David Howells says:
====================
rxrpc: Preparation for slow-start algorithm [ver #2]
Here are some patches that prepare for improvements in ACK generation and
for the implementation of the slow-start part of the protocol:
(1) Stop storing the protocol header in the Tx socket buffers, but rather
generate it on the fly. This potentially saves a little space and
makes it easier to alter the header just before transmission (the
flags may get altered and the serial number has to be changed).
(2) Mask off the Tx buffer annotations and add a flag to record which ones
have already been resent.
(3) Track RTT on a per-peer basis for use in future changes. Tracepoints
are added to log this.
(4) Send PING ACKs in response to incoming calls to elicit a PING-RESPONSE
ACK from which RTT data can be calculated. The response also carries
other useful information.
(5) Expedite PING-RESPONSE ACK generation from sendmsg. If we're actively
using sendmsg, this allows us, under some circumstances, to avoid
having to rely on the background work item to run to generate this
ACK.
This requires ktime_sub_ms() to be added.
(6) Set the REQUEST-ACK flag on some DATA packets to elicit ACK-REQUESTED
ACKs from which RTT data can be calculated.
(7) Limit the use of pings and ACK requests for RTT determination.
Changes:
(V2) Don't use the C division operator for 64-bit division. One instance
should use do_div() and the other should be using nsecs_to_jiffies().
The last two patches got transposed, leading to an undefined symbol
in one of them.
Reported-by: kbuild test robot <lkp@intel.com>
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
We don't want to send a PING ACK for every new incoming call as that just
adds to the network traffic. Instead, we send a PING ACK to the first
three that we receive and then once per second thereafter.
This could probably be made adjustable in future.
Signed-off-by: David Howells <dhowells@redhat.com>
|
|
Reduce the number of ACK-Requests we set on DATA packets that we're sending
to reduce network traffic. We set the flag on odd-numbered DATA packets to
start off the RTT cache until we have at least three entries in it and then
probe once per second thereafter to keep it topped up.
This could be made tunable in future.
Note that from this point, the RXRPC_REQUEST_ACK flag is set on DATA
packets as we transmit them and not stored statically in the sk_buff.
Signed-off-by: David Howells <dhowells@redhat.com>
|
|
Since the TFO socket is accepted right off SYN-data, the socket
owner can call getsockopt(TCP_INFO) to collect ongoing SYN-ACK
retransmission or timeout stats (i.e., tcpi_total_retrans,
tcpi_retransmits). Currently those stats are only updated
upon handshake completes. This patch fixes it.
Signed-off-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: Soheil Hassas Yeganeh <soheil@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
This patch fixes these under-accounting SNMP rtx stats
LINUX_MIB_TCPFORWARDRETRANS
LINUX_MIB_TCPFASTRETRANS
LINUX_MIB_TCPSLOWSTARTRETRANS
when retransmitting TSO packets
Fixes: 10d3be569243 ("tcp-tso: do not split TSO packets at retransmit time")
Signed-off-by: Yuchung Cheng <ycheng@google.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
In addition to sending a PING ACK to gain RTT data, we can set the
RXRPC_REQUEST_ACK flag on a DATA packet and get a REQUESTED-ACK ACK. The
ACK packet contains the serial number of the packet it is in response to,
so we can look through the Tx buffer for a matching DATA packet.
This requires that the data packets be stamped with the time of
transmission as a ktime rather than having the resend_at time in jiffies.
This further requires the resend code to do the resend determination in
ktimes and convert to jiffies to set the timer.
Signed-off-by: David Howells <dhowells@redhat.com>
|
|
Expedite the transmission of a response to a PING ACK by sending it from
sendmsg if one is pending. We're most likely to see a PING ACK during the
client call Tx phase as the other side may use it to determine a number of
parameters, such as the client's receive window size, the RTT and whether
the client is doing slow start (similar to RFC5681).
If we don't expedite it, it's left to the background processing thread to
transmit.
Signed-off-by: David Howells <dhowells@redhat.com>
|
|
Send a PING ACK packet to the peer when we get a new incoming call from a
peer we don't have a record for. The PING RESPONSE ACK packet will tell us
the following about the peer:
(1) its receive window size
(2) its MTU sizes
(3) its support for jumbo DATA packets
(4) if it supports slow start (similar to RFC 5681)
(5) an estimate of the RTT
This is necessary because the peer won't normally send us an ACK until it
gets to the Rx phase and we send it a packet, but we would like to know
some of this information before we start sending packets.
A pair of tracepoints are added so that RTT determination can be observed.
Signed-off-by: David Howells <dhowells@redhat.com>
|
|
And avoid the usage of '&~3'. This is the last place still not using
the macro.
Also break the line to make it easier to read.
Signed-off-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
To something more meaningful these days, specially because this is
working on packet headers or lengths and which are not tied to any CPU
arch but to the protocol itself.
So, WORD_TRUNC becomes SCTP_TRUNC4 and WORD_ROUND becomes SCTP_PAD4.
Reported-by: David Laight <David.Laight@ACULAB.COM>
Reported-by: David Miller <davem@davemloft.net>
Signed-off-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/klassert/ipsec
Steffen Klassert says:
====================
pull request (net): ipsec 2016-09-21
1) Propagate errors on security context allocation.
From Mathias Krause.
2) Fix inbound policy checks for inter address family tunnels.
From Thomas Zeitlhofer.
3) Fix an old memory leak on aead algorithm usage.
From Ilan Tayari.
4) A recent patch fixed a possible NULL pointer dereference
but broke the vti6 input path.
Fix from Nicolas Dichtel.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
We saw sch_fq drops caused by the per flow limit of 100 packets and TCP
when dealing with large cwnd and bursts of retransmits.
Even after increasing the limit to 1000, and even after commit
10d3be569243 ("tcp-tso: do not split TSO packets at retransmit time"),
we can still have these drops.
Under certain conditions, TCP can spend a considerable amount of
time queuing thousands of skbs in a single tcp_xmit_retransmit_queue()
invocation, incurring latency spikes and stalls of other softirq
handlers.
This patch implements TSQ for retransmits, limiting number of packets
and giving more chance for scheduling packets in both ways.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Jiri Pirko reported an UBSAN warning happening in ip_idents_reserve()
[] UBSAN: Undefined behaviour in ./arch/x86/include/asm/atomic.h:156:11
[] signed integer overflow:
[] -2117905507 + -695755206 cannot be represented in type 'int'
Since we do not have uatomic_add_return() yet, use atomic_cmpxchg()
so that the arithmetics can be done using unsigned int.
Fixes: 04ca6973f7c1 ("ip: make IP identifiers less predictable")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: Jiri Pirko <jiri@resnulli.us>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Fix 'skb_vlan_pop' to use eth_type_vlan instead of directly comparing
skb->protocol to ETH_P_8021Q or ETH_P_8021AD.
Signed-off-by: Shmulik Ladkani <shmulik.ladkani@gmail.com>
Reviewed-by: Pravin B Shelar <pshelar@ovn.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
In 93515d53b1
"net: move vlan pop/push functions into common code"
skb_vlan_pop was moved from its private location in openvswitch to
skbuff common code.
In case skb has non hw-accel vlan tag, the original 'pop_vlan()' assured
that skb->len is sufficient (if skb->len < VLAN_ETH_HLEN then pop was
considered a no-op).
This validation was moved as is into the new common 'skb_vlan_pop'.
Alas, in its original location (openvswitch), there was a guarantee that
'data' points to the mac_header, therefore the 'skb->len < VLAN_ETH_HLEN'
condition made sense.
However there's no such guarantee in the generic 'skb_vlan_pop'.
For short packets received in rx path going through 'skb_vlan_pop',
this causes 'skb_vlan_pop' to fail pop-ing a valid vlan hdr (in the non
hw-accel case) or to fail moving next tag into hw-accel tag.
Remove the 'skb->len < VLAN_ETH_HLEN' condition entirely:
It is superfluous since inner '__skb_vlan_pop' already verifies there
are VLAN_ETH_HLEN writable bytes at the mac_header.
Note this presents a slight change to skb_vlan_pop() users:
In case total length is smaller than VLAN_ETH_HLEN, skb_vlan_pop() now
returns an error, as opposed to previous "no-op" behavior.
Existing callers (e.g. tc act vlan, ovs) usually drop the packet if
'skb_vlan_pop' fails.
Fixes: 93515d53b1 ("net: move vlan pop/push functions into common code")
Signed-off-by: Shmulik Ladkani <shmulik.ladkani@gmail.com>
Cc: Pravin Shelar <pshelar@ovn.org>
Reviewed-by: Pravin B Shelar <pshelar@ovn.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
TCA_VLAN_ACT_MODIFY allows one to change an existing tag.
It accepts same attributes as TCA_VLAN_ACT_PUSH (protocol, id,
priority).
If packet is vlan tagged, then the tag gets overwritten according to
user specified attributes.
For example, this allows user to replace a tag's vid while preserving
its priority bits (as opposed to "action vlan pop pipe action vlan push").
Signed-off-by: Shmulik Ladkani <shmulik.ladkani@gmail.com>
Acked-by: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
This exports the functionality of extracting the tag from the payload,
without moving next vlan tag into hw accel tag.
Signed-off-by: Shmulik Ladkani <shmulik.ladkani@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Add a function to track the average RTT for a peer. Sources of RTT data
will be added in subsequent patches.
The RTT data will be useful in the future for determining resend timeouts
and for handling the slow-start part of the Rx protocol.
Also add a pair of tracepoints, one to log transmissions to elicit a
response for RTT purposes and one to log responses that contribute RTT
data.
Signed-off-by: David Howells <dhowells@redhat.com>
|
|
Add a Tx-phase annotation for packet buffers to indicate that a buffer has
already been retransmitted. This will be used by future congestion
management. Re-retransmissions of a packet don't affect the congestion
window managment in the same way as initial retransmissions.
Signed-off-by: David Howells <dhowells@redhat.com>
|
|
Don't store the rxrpc protocol header in sk_buffs on the transmit queue,
but rather generate it on the fly and pass it to kernel_sendmsg() as a
separate iov. This reduces the amount of storage required.
Note that the security header is still stored in the sk_buff as it may get
encrypted along with the data (and doesn't change with each transmission).
Signed-off-by: David Howells <dhowells@redhat.com>
|
|
Implement .stats_update() callback. The implementation
is generic and can be reused by other simple actions if
needed.
Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Call into offloaded filters to update stats.
Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Add cls_bpf support for the TCA_CLS_FLAGS_SKIP_SW flag.
Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Add cls_bpf support for the TCA_CLS_FLAGS_SKIP_HW flag.
Unlike U32 and flower cls_bpf already has some netlink
flags defined. Create a new attribute to be able to use
the same flag values as the above.
Unlike U32 and flower reject unknown flags.
Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
This patch adds hardware offload capability to cls_bpf classifier,
similar to what have been done with U32 and flower.
Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
This is called from the packet input path, we get lock contention
if many cpus handle ipsec in parallel.
After recent rcu conversion it is safe to call __xfrm_state_lookup
without the spinlock.
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
|
|
Since commit 1625f4529957, vti6 is broken, all input packets are dropped
(LINUX_MIB_XFRMINNOSTATES is incremented).
XFRM_TUNNEL_SKB_CB(skb)->tunnel.ip6 is set by vti6_rcv() before calling
xfrm6_rcv()/xfrm6_rcv_spi(), thus we cannot set to NULL that value in
xfrm6_rcv_spi().
A new function xfrm6_rcv_tnl() that enables to pass a value to
xfrm6_rcv_spi() is added, so that xfrm6_rcv() is not touched (this function
is used in several handlers).
CC: Alexey Kodanev <alexey.kodanev@oracle.com>
Fixes: 1625f4529957 ("net/xfrm_input: fix possible NULL deref of tunnel.ip6->parms.i_key")
Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
|
|
When I introduced the lastuse member I made a subtle error because it was
returned as an absolute value but that is meaningless to user-space as it
doesn't allow to see how old exactly an entry is. Let's make it similar to
how the bridge returns such values and make it relative to "now" (jiffies).
This allows us to show the actual age of the entries and is much more
useful (e.g. user-space daemons can age out entries, iproute2 can display
the lastuse properly).
Fixes: 43b9e1274060 ("net: ipmr/ip6mr: add support for keeping an entry age")
Reported-by: Satish Ashok <sashok@cumulusnetworks.com>
Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
This commit implements a new TCP congestion control algorithm: BBR
(Bottleneck Bandwidth and RTT). A detailed description of BBR will be
published in ACM Queue, Vol. 14 No. 5, September-October 2016, as
"BBR: Congestion-Based Congestion Control".
BBR has significantly increased throughput and reduced latency for
connections on Google's internal backbone networks and google.com and
YouTube Web servers.
BBR requires only changes on the sender side, not in the network or
the receiver side. Thus it can be incrementally deployed on today's
Internet, or in datacenters.
The Internet has predominantly used loss-based congestion control
(largely Reno or CUBIC) since the 1980s, relying on packet loss as the
signal to slow down. While this worked well for many years, loss-based
congestion control is unfortunately out-dated in today's networks. On
today's Internet, loss-based congestion control causes the infamous
bufferbloat problem, often causing seconds of needless queuing delay,
since it fills the bloated buffers in many last-mile links. On today's
high-speed long-haul links using commodity switches with shallow
buffers, loss-based congestion control has abysmal throughput because
it over-reacts to losses caused by transient traffic bursts.
In 1981 Kleinrock and Gale showed that the optimal operating point for
a network maximizes delivered bandwidth while minimizing delay and
loss, not only for single connections but for the network as a
whole. Finding that optimal operating point has been elusive, since
any single network measurement is ambiguous: network measurements are
the result of both bandwidth and propagation delay, and those two
cannot be measured simultaneously.
While it is impossible to disambiguate any single bandwidth or RTT
measurement, a connection's behavior over time tells a clearer
story. BBR uses a measurement strategy designed to resolve this
ambiguity. It combines these measurements with a robust servo loop
using recent control systems advances to implement a distributed
congestion control algorithm that reacts to actual congestion, not
packet loss or transient queue delay, and is designed to converge with
high probability to a point near the optimal operating point.
In a nutshell, BBR creates an explicit model of the network pipe by
sequentially probing the bottleneck bandwidth and RTT. On the arrival
of each ACK, BBR derives the current delivery rate of the last round
trip, and feeds it through a windowed max-filter to estimate the
bottleneck bandwidth. Conversely it uses a windowed min-filter to
estimate the round trip propagation delay. The max-filtered bandwidth
and min-filtered RTT estimates form BBR's model of the network pipe.
Using its model, BBR sets control parameters to govern sending
behavior. The primary control is the pacing rate: BBR applies a gain
multiplier to transmit faster or slower than the observed bottleneck
bandwidth. The conventional congestion window (cwnd) is now the
secondary control; the cwnd is set to a small multiple of the
estimated BDP (bandwidth-delay product) in order to allow full
utilization and bandwidth probing while bounding the potential amount
of queue at the bottleneck.
When a BBR connection starts, it enters STARTUP mode and applies a
high gain to perform an exponential search to quickly probe the
bottleneck bandwidth (doubling its sending rate each round trip, like
slow start). However, instead of continuing until it fills up the
buffer (i.e. a loss), or until delay or ACK spacing reaches some
threshold (like Hystart), it uses its model of the pipe to estimate
when that pipe is full: it estimates the pipe is full when it notices
the estimated bandwidth has stopped growing. At that point it exits
STARTUP and enters DRAIN mode, where it reduces its pacing rate to
drain the queue it estimates it has created.
Then BBR enters steady state. In steady state, PROBE_BW mode cycles
between first pacing faster to probe for more bandwidth, then pacing
slower to drain any queue that created if no more bandwidth was
available, and then cruising at the estimated bandwidth to utilize the
pipe without creating excess queue. Occasionally, on an as-needed
basis, it sends significantly slower to probe for RTT (PROBE_RTT
mode).
BBR has been fully deployed on Google's wide-area backbone networks
and we're experimenting with BBR on Google.com and YouTube on a global
scale. Replacing CUBIC with BBR has resulted in significant
improvements in network latency and application (RPC, browser, and
video) metrics. For more details please refer to our upcoming ACM
Queue publication.
Example performance results, to illustrate the difference between BBR
and CUBIC:
Resilience to random loss (e.g. from shallow buffers):
Consider a netperf TCP_STREAM test lasting 30 secs on an emulated
path with a 10Gbps bottleneck, 100ms RTT, and 1% packet loss
rate. CUBIC gets 3.27 Mbps, and BBR gets 9150 Mbps (2798x higher).
Low latency with the bloated buffers common in today's last-mile links:
Consider a netperf TCP_STREAM test lasting 120 secs on an emulated
path with a 10Mbps bottleneck, 40ms RTT, and 1000-packet bottleneck
buffer. Both fully utilize the bottleneck bandwidth, but BBR
achieves this with a median RTT 25x lower (43 ms instead of 1.09
secs).
Our long-term goal is to improve the congestion control algorithms
used on the Internet. We are hopeful that BBR can help advance the
efforts toward this goal, and motivate the community to do further
research.
Test results, performance evaluations, feedback, and BBR-related
discussions are very welcome in the public e-mail list for BBR:
https://groups.google.com/forum/#!forum/bbr-dev
NOTE: BBR *must* be used with the fq qdisc ("man tc-fq") with pacing
enabled, since pacing is integral to the BBR design and
implementation. BBR without pacing would not function properly, and
may incur unnecessary high packet loss rates.
Signed-off-by: Van Jacobson <vanj@google.com>
Signed-off-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: Nandita Dukkipati <nanditad@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Soheil Hassas Yeganeh <soheil@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
This commit introduces an optional new "omnipotent" hook,
cong_control(), for congestion control modules. The cong_control()
function is called at the end of processing an ACK (i.e., after
updating sequence numbers, the SACK scoreboard, and loss
detection). At that moment we have precise delivery rate information
the congestion control module can use to control the sending behavior
(using cwnd, TSO skb size, and pacing rate) in any CA state.
This function can also be used by a congestion control that prefers
not to use the default cwnd reduction approach (i.e., the PRR
algorithm) during CA_Recovery to control the cwnd and sending rate
during loss recovery.
We take advantage of the fact that recent changes defer the
retransmission or transmission of new data (e.g. by F-RTO) in recovery
until the new tcp_cong_control() function is run.
With this commit, we only run tcp_update_pacing_rate() if the
congestion control is not using this new API. New congestion controls
which use the new API do not want the TCP stack to run the default
pacing rate calculation and overwrite whatever pacing rate they have
chosen at initialization time.
Signed-off-by: Van Jacobson <vanj@google.com>
Signed-off-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: Nandita Dukkipati <nanditad@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Soheil Hassas Yeganeh <soheil@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Currently the TCP send buffer expands to twice cwnd, in order to allow
limited transmits in the CA_Recovery state. This assumes that cwnd
does not increase in the CA_Recovery.
For some congestion control algorithms, like the upcoming BBR module,
if the losses in recovery do not indicate congestion then we may
continue to raise cwnd multiplicatively in recovery. In such cases the
current multiplier will falsely limit the sending rate, much as if it
were limited by the application.
This commit adds an optional congestion control callback to use a
different multiplier to expand the TCP send buffer. For congestion
control modules that do not specificy this callback, TCP continues to
use the previous default of 2.
Signed-off-by: Van Jacobson <vanj@google.com>
Signed-off-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: Nandita Dukkipati <nanditad@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Soheil Hassas Yeganeh <soheil@google.com>
Acked-by: Stephen Hemminger <stephen@networkplumber.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Export tcp_mss_to_mtu(), so that congestion control modules can use
this to help calculate a pacing rate.
Signed-off-by: Van Jacobson <vanj@google.com>
Signed-off-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: Nandita Dukkipati <nanditad@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Soheil Hassas Yeganeh <soheil@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|