From e943789edbb1f9de71b129d9992489eb79ed341f Mon Sep 17 00:00:00 2001 From: Johannes Berg Date: Fri, 15 Feb 2013 21:38:08 +0100 Subject: mac80211: provide ieee80211_sta_eosp() The irqsafe version ieee80211_sta_eosp_irqsafe() exists, but drivers must not mix calls to any irqsafe/non-irqsafe function. Both ath9k and iwlwifi, the likely first users of this interface, use non-irqsafe RX/TX/TX status so must also use a non-irqsafe version of this function. Since no driver uses the _irqsafe() version, remove that. Signed-off-by: Johannes Berg --- Documentation/DocBook/80211.tmpl | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) (limited to 'Documentation') diff --git a/Documentation/DocBook/80211.tmpl b/Documentation/DocBook/80211.tmpl index 284ced7a228f..0f6a3edcd44b 100644 --- a/Documentation/DocBook/80211.tmpl +++ b/Documentation/DocBook/80211.tmpl @@ -437,7 +437,7 @@ !Finclude/net/mac80211.h ieee80211_get_buffered_bc !Finclude/net/mac80211.h ieee80211_beacon_get -!Finclude/net/mac80211.h ieee80211_sta_eosp_irqsafe +!Finclude/net/mac80211.h ieee80211_sta_eosp !Finclude/net/mac80211.h ieee80211_frame_release_type !Finclude/net/mac80211.h ieee80211_sta_ps_transition !Finclude/net/mac80211.h ieee80211_sta_ps_transition_ni -- cgit v1.2.3 From 6ba8a3b19e764b6a65e4030ab0999be50c291e6c Mon Sep 17 00:00:00 2001 From: Nandita Dukkipati Date: Mon, 11 Mar 2013 10:00:43 +0000 Subject: tcp: Tail loss probe (TLP) This patch series implement the Tail loss probe (TLP) algorithm described in http://tools.ietf.org/html/draft-dukkipati-tcpm-tcp-loss-probe-01. The first patch implements the basic algorithm. TLP's goal is to reduce tail latency of short transactions. It achieves this by converting retransmission timeouts (RTOs) occuring due to tail losses (losses at end of transactions) into fast recovery. TLP transmits one packet in two round-trips when a connection is in Open state and isn't receiving any ACKs. The transmitted packet, aka loss probe, can be either new or a retransmission. When there is tail loss, the ACK from a loss probe triggers FACK/early-retransmit based fast recovery, thus avoiding a costly RTO. In the absence of loss, there is no change in the connection state. PTO stands for probe timeout. It is a timer event indicating that an ACK is overdue and triggers a loss probe packet. The PTO value is set to max(2*SRTT, 10ms) and is adjusted to account for delayed ACK timer when there is only one oustanding packet. TLP Algorithm On transmission of new data in Open state: -> packets_out > 1: schedule PTO in max(2*SRTT, 10ms). -> packets_out == 1: schedule PTO in max(2*RTT, 1.5*RTT + 200ms) -> PTO = min(PTO, RTO) Conditions for scheduling PTO: -> Connection is in Open state. -> Connection is either cwnd limited or no new data to send. -> Number of probes per tail loss episode is limited to one. -> Connection is SACK enabled. When PTO fires: new_segment_exists: -> transmit new segment. -> packets_out++. cwnd remains same. no_new_packet: -> retransmit the last segment. Its ACK triggers FACK or early retransmit based recovery. ACK path: -> rearm RTO at start of ACK processing. -> reschedule PTO if need be. In addition, the patch includes a small variation to the Early Retransmit (ER) algorithm, such that ER and TLP together can in principle recover any N-degree of tail loss through fast recovery. TLP is controlled by the same sysctl as ER, tcp_early_retrans sysctl. tcp_early_retrans==0; disables TLP and ER. ==1; enables RFC5827 ER. ==2; delayed ER. ==3; TLP and delayed ER. [DEFAULT] ==4; TLP only. The TLP patch series have been extensively tested on Google Web servers. It is most effective for short Web trasactions, where it reduced RTOs by 15% and improved HTTP response time (average by 6%, 99th percentile by 10%). The transmitted probes account for <0.5% of the overall transmissions. Signed-off-by: Nandita Dukkipati Acked-by: Neal Cardwell Acked-by: Yuchung Cheng Signed-off-by: David S. Miller --- Documentation/networking/ip-sysctl.txt | 8 ++- include/linux/tcp.h | 1 - include/net/inet_connection_sock.h | 5 +- include/net/tcp.h | 6 +- include/uapi/linux/snmp.h | 1 + net/ipv4/inet_diag.c | 4 +- net/ipv4/proc.c | 1 + net/ipv4/sysctl_net_ipv4.c | 4 +- net/ipv4/tcp_input.c | 24 ++++--- net/ipv4/tcp_ipv4.c | 4 +- net/ipv4/tcp_output.c | 128 +++++++++++++++++++++++++++++++-- net/ipv4/tcp_timer.c | 13 ++-- 12 files changed, 171 insertions(+), 28 deletions(-) (limited to 'Documentation') diff --git a/Documentation/networking/ip-sysctl.txt b/Documentation/networking/ip-sysctl.txt index dc2dc87d2557..1cae6c383e1b 100644 --- a/Documentation/networking/ip-sysctl.txt +++ b/Documentation/networking/ip-sysctl.txt @@ -190,7 +190,9 @@ tcp_early_retrans - INTEGER Enable Early Retransmit (ER), per RFC 5827. ER lowers the threshold for triggering fast retransmit when the amount of outstanding data is small and when no previously unsent data can be transmitted (such - that limited transmit could be used). + that limited transmit could be used). Also controls the use of + Tail loss probe (TLP) that converts RTOs occuring due to tail + losses into fast recovery (draft-dukkipati-tcpm-tcp-loss-probe-01). Possible values: 0 disables ER 1 enables ER @@ -198,7 +200,9 @@ tcp_early_retrans - INTEGER by a fourth of RTT. This mitigates connection falsely recovers when network has a small degree of reordering (less than 3 packets). - Default: 2 + 3 enables delayed ER and TLP. + 4 enables TLP only. + Default: 3 tcp_ecn - INTEGER Control use of Explicit Congestion Notification (ECN) by TCP. diff --git a/include/linux/tcp.h b/include/linux/tcp.h index 515c3746b675..01860d74555c 100644 --- a/include/linux/tcp.h +++ b/include/linux/tcp.h @@ -201,7 +201,6 @@ struct tcp_sock { unused : 1; u8 repair_queue; u8 do_early_retrans:1,/* Enable RFC5827 early-retransmit */ - early_retrans_delayed:1, /* Delayed ER timer installed */ syn_data:1, /* SYN includes data */ syn_fastopen:1, /* SYN includes Fast Open option */ syn_data_acked:1;/* data in SYN is acked by SYN-ACK */ diff --git a/include/net/inet_connection_sock.h b/include/net/inet_connection_sock.h index 183292722f6e..de2c78529afa 100644 --- a/include/net/inet_connection_sock.h +++ b/include/net/inet_connection_sock.h @@ -133,6 +133,8 @@ struct inet_connection_sock { #define ICSK_TIME_RETRANS 1 /* Retransmit timer */ #define ICSK_TIME_DACK 2 /* Delayed ack timer */ #define ICSK_TIME_PROBE0 3 /* Zero window probe timer */ +#define ICSK_TIME_EARLY_RETRANS 4 /* Early retransmit timer */ +#define ICSK_TIME_LOSS_PROBE 5 /* Tail loss probe timer */ static inline struct inet_connection_sock *inet_csk(const struct sock *sk) { @@ -222,7 +224,8 @@ static inline void inet_csk_reset_xmit_timer(struct sock *sk, const int what, when = max_when; } - if (what == ICSK_TIME_RETRANS || what == ICSK_TIME_PROBE0) { + if (what == ICSK_TIME_RETRANS || what == ICSK_TIME_PROBE0 || + what == ICSK_TIME_EARLY_RETRANS || what == ICSK_TIME_LOSS_PROBE) { icsk->icsk_pending = what; icsk->icsk_timeout = jiffies + when; sk_reset_timer(sk, &icsk->icsk_retransmit_timer, icsk->icsk_timeout); diff --git a/include/net/tcp.h b/include/net/tcp.h index a2baa5e4ba31..ab9f947b118b 100644 --- a/include/net/tcp.h +++ b/include/net/tcp.h @@ -543,6 +543,8 @@ extern bool tcp_syn_flood_action(struct sock *sk, extern void tcp_push_one(struct sock *, unsigned int mss_now); extern void tcp_send_ack(struct sock *sk); extern void tcp_send_delayed_ack(struct sock *sk); +extern void tcp_send_loss_probe(struct sock *sk); +extern bool tcp_schedule_loss_probe(struct sock *sk); /* tcp_input.c */ extern void tcp_cwnd_application_limited(struct sock *sk); @@ -873,8 +875,8 @@ static inline void tcp_enable_fack(struct tcp_sock *tp) static inline void tcp_enable_early_retrans(struct tcp_sock *tp) { tp->do_early_retrans = sysctl_tcp_early_retrans && - !sysctl_tcp_thin_dupack && sysctl_tcp_reordering == 3; - tp->early_retrans_delayed = 0; + sysctl_tcp_early_retrans < 4 && !sysctl_tcp_thin_dupack && + sysctl_tcp_reordering == 3; } static inline void tcp_disable_early_retrans(struct tcp_sock *tp) diff --git a/include/uapi/linux/snmp.h b/include/uapi/linux/snmp.h index b49eab89c9fd..290bed6b085f 100644 --- a/include/uapi/linux/snmp.h +++ b/include/uapi/linux/snmp.h @@ -202,6 +202,7 @@ enum LINUX_MIB_TCPFORWARDRETRANS, /* TCPForwardRetrans */ LINUX_MIB_TCPSLOWSTARTRETRANS, /* TCPSlowStartRetrans */ LINUX_MIB_TCPTIMEOUTS, /* TCPTimeouts */ + LINUX_MIB_TCPLOSSPROBES, /* TCPLossProbes */ LINUX_MIB_TCPRENORECOVERYFAIL, /* TCPRenoRecoveryFail */ LINUX_MIB_TCPSACKRECOVERYFAIL, /* TCPSackRecoveryFail */ LINUX_MIB_TCPSCHEDULERFAILED, /* TCPSchedulerFailed */ diff --git a/net/ipv4/inet_diag.c b/net/ipv4/inet_diag.c index 7afa2c3c788f..8620408af574 100644 --- a/net/ipv4/inet_diag.c +++ b/net/ipv4/inet_diag.c @@ -158,7 +158,9 @@ int inet_sk_diag_fill(struct sock *sk, struct inet_connection_sock *icsk, #define EXPIRES_IN_MS(tmo) DIV_ROUND_UP((tmo - jiffies) * 1000, HZ) - if (icsk->icsk_pending == ICSK_TIME_RETRANS) { + if (icsk->icsk_pending == ICSK_TIME_RETRANS || + icsk->icsk_pending == ICSK_TIME_EARLY_RETRANS || + icsk->icsk_pending == ICSK_TIME_LOSS_PROBE) { r->idiag_timer = 1; r->idiag_retrans = icsk->icsk_retransmits; r->idiag_expires = EXPIRES_IN_MS(icsk->icsk_timeout); diff --git a/net/ipv4/proc.c b/net/ipv4/proc.c index 32030a24e776..4c35911d935f 100644 --- a/net/ipv4/proc.c +++ b/net/ipv4/proc.c @@ -224,6 +224,7 @@ static const struct snmp_mib snmp4_net_list[] = { SNMP_MIB_ITEM("TCPForwardRetrans", LINUX_MIB_TCPFORWARDRETRANS), SNMP_MIB_ITEM("TCPSlowStartRetrans", LINUX_MIB_TCPSLOWSTARTRETRANS), SNMP_MIB_ITEM("TCPTimeouts", LINUX_MIB_TCPTIMEOUTS), + SNMP_MIB_ITEM("TCPLossProbes", LINUX_MIB_TCPLOSSPROBES), SNMP_MIB_ITEM("TCPRenoRecoveryFail", LINUX_MIB_TCPRENORECOVERYFAIL), SNMP_MIB_ITEM("TCPSackRecoveryFail", LINUX_MIB_TCPSACKRECOVERYFAIL), SNMP_MIB_ITEM("TCPSchedulerFailed", LINUX_MIB_TCPSCHEDULERFAILED), diff --git a/net/ipv4/sysctl_net_ipv4.c b/net/ipv4/sysctl_net_ipv4.c index 960fd29d9b8e..cca4550f4082 100644 --- a/net/ipv4/sysctl_net_ipv4.c +++ b/net/ipv4/sysctl_net_ipv4.c @@ -28,7 +28,7 @@ static int zero; static int one = 1; -static int two = 2; +static int four = 4; static int tcp_retr1_max = 255; static int ip_local_port_range_min[] = { 1, 1 }; static int ip_local_port_range_max[] = { 65535, 65535 }; @@ -760,7 +760,7 @@ static struct ctl_table ipv4_table[] = { .mode = 0644, .proc_handler = proc_dointvec_minmax, .extra1 = &zero, - .extra2 = &two, + .extra2 = &four, }, { .procname = "udp_mem", diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c index 0d9bdacce99f..b794f89ac1f2 100644 --- a/net/ipv4/tcp_input.c +++ b/net/ipv4/tcp_input.c @@ -98,7 +98,7 @@ int sysctl_tcp_frto_response __read_mostly; int sysctl_tcp_thin_dupack __read_mostly; int sysctl_tcp_moderate_rcvbuf __read_mostly = 1; -int sysctl_tcp_early_retrans __read_mostly = 2; +int sysctl_tcp_early_retrans __read_mostly = 3; #define FLAG_DATA 0x01 /* Incoming frame contained data. */ #define FLAG_WIN_UPDATE 0x02 /* Incoming ACK was a window update. */ @@ -2150,15 +2150,16 @@ static bool tcp_pause_early_retransmit(struct sock *sk, int flag) * max(RTT/4, 2msec) unless ack has ECE mark, no RTT samples * available, or RTO is scheduled to fire first. */ - if (sysctl_tcp_early_retrans < 2 || (flag & FLAG_ECE) || !tp->srtt) + if (sysctl_tcp_early_retrans < 2 || sysctl_tcp_early_retrans > 3 || + (flag & FLAG_ECE) || !tp->srtt) return false; delay = max_t(unsigned long, (tp->srtt >> 5), msecs_to_jiffies(2)); if (!time_after(inet_csk(sk)->icsk_timeout, (jiffies + delay))) return false; - inet_csk_reset_xmit_timer(sk, ICSK_TIME_RETRANS, delay, TCP_RTO_MAX); - tp->early_retrans_delayed = 1; + inet_csk_reset_xmit_timer(sk, ICSK_TIME_EARLY_RETRANS, delay, + TCP_RTO_MAX); return true; } @@ -2321,7 +2322,7 @@ static bool tcp_time_to_recover(struct sock *sk, int flag) * interval if appropriate. */ if (tp->do_early_retrans && !tp->retrans_out && tp->sacked_out && - (tp->packets_out == (tp->sacked_out + 1) && tp->packets_out < 4) && + (tp->packets_out >= (tp->sacked_out + 1) && tp->packets_out < 4) && !tcp_may_send_now(sk)) return !tcp_pause_early_retransmit(sk, flag); @@ -3081,6 +3082,7 @@ static void tcp_cong_avoid(struct sock *sk, u32 ack, u32 in_flight) */ void tcp_rearm_rto(struct sock *sk) { + const struct inet_connection_sock *icsk = inet_csk(sk); struct tcp_sock *tp = tcp_sk(sk); /* If the retrans timer is currently being used by Fast Open @@ -3094,12 +3096,13 @@ void tcp_rearm_rto(struct sock *sk) } else { u32 rto = inet_csk(sk)->icsk_rto; /* Offset the time elapsed after installing regular RTO */ - if (tp->early_retrans_delayed) { + if (icsk->icsk_pending == ICSK_TIME_EARLY_RETRANS || + icsk->icsk_pending == ICSK_TIME_LOSS_PROBE) { struct sk_buff *skb = tcp_write_queue_head(sk); const u32 rto_time_stamp = TCP_SKB_CB(skb)->when + rto; s32 delta = (s32)(rto_time_stamp - tcp_time_stamp); /* delta may not be positive if the socket is locked - * when the delayed ER timer fires and is rescheduled. + * when the retrans timer fires and is rescheduled. */ if (delta > 0) rto = delta; @@ -3107,7 +3110,6 @@ void tcp_rearm_rto(struct sock *sk) inet_csk_reset_xmit_timer(sk, ICSK_TIME_RETRANS, rto, TCP_RTO_MAX); } - tp->early_retrans_delayed = 0; } /* This function is called when the delayed ER timer fires. TCP enters @@ -3601,7 +3603,8 @@ static int tcp_ack(struct sock *sk, const struct sk_buff *skb, int flag) if (after(ack, tp->snd_nxt)) goto invalid_ack; - if (tp->early_retrans_delayed) + if (icsk->icsk_pending == ICSK_TIME_EARLY_RETRANS || + icsk->icsk_pending == ICSK_TIME_LOSS_PROBE) tcp_rearm_rto(sk); if (after(ack, prior_snd_una)) @@ -3678,6 +3681,9 @@ static int tcp_ack(struct sock *sk, const struct sk_buff *skb, int flag) if (dst) dst_confirm(dst); } + + if (icsk->icsk_pending == ICSK_TIME_RETRANS) + tcp_schedule_loss_probe(sk); return 1; no_queue: diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c index 8cdee120a50c..b7ab868c8284 100644 --- a/net/ipv4/tcp_ipv4.c +++ b/net/ipv4/tcp_ipv4.c @@ -2703,7 +2703,9 @@ static void get_tcp4_sock(struct sock *sk, struct seq_file *f, int i, int *len) __u16 srcp = ntohs(inet->inet_sport); int rx_queue; - if (icsk->icsk_pending == ICSK_TIME_RETRANS) { + if (icsk->icsk_pending == ICSK_TIME_RETRANS || + icsk->icsk_pending == ICSK_TIME_EARLY_RETRANS || + icsk->icsk_pending == ICSK_TIME_LOSS_PROBE) { timer_active = 1; timer_expires = icsk->icsk_timeout; } else if (icsk->icsk_pending == ICSK_TIME_PROBE0) { diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c index e2b4461074da..beb63dbc85f5 100644 --- a/net/ipv4/tcp_output.c +++ b/net/ipv4/tcp_output.c @@ -74,6 +74,7 @@ static bool tcp_write_xmit(struct sock *sk, unsigned int mss_now, int nonagle, /* Account for new data that has been sent to the network. */ static void tcp_event_new_data_sent(struct sock *sk, const struct sk_buff *skb) { + struct inet_connection_sock *icsk = inet_csk(sk); struct tcp_sock *tp = tcp_sk(sk); unsigned int prior_packets = tp->packets_out; @@ -85,7 +86,8 @@ static void tcp_event_new_data_sent(struct sock *sk, const struct sk_buff *skb) tp->frto_counter = 3; tp->packets_out += tcp_skb_pcount(skb); - if (!prior_packets || tp->early_retrans_delayed) + if (!prior_packets || icsk->icsk_pending == ICSK_TIME_EARLY_RETRANS || + icsk->icsk_pending == ICSK_TIME_LOSS_PROBE) tcp_rearm_rto(sk); } @@ -1959,6 +1961,9 @@ static int tcp_mtu_probe(struct sock *sk) * snd_up-64k-mss .. snd_up cannot be large. However, taking into * account rare use of URG, this is not a big flaw. * + * Send at most one packet when push_one > 0. Temporarily ignore + * cwnd limit to force at most one packet out when push_one == 2. + * Returns true, if no segments are in flight and we have queued segments, * but cannot send anything now because of SWS or another problem. */ @@ -1994,8 +1999,13 @@ static bool tcp_write_xmit(struct sock *sk, unsigned int mss_now, int nonagle, goto repair; /* Skip network transmission */ cwnd_quota = tcp_cwnd_test(tp, skb); - if (!cwnd_quota) - break; + if (!cwnd_quota) { + if (push_one == 2) + /* Force out a loss probe pkt. */ + cwnd_quota = 1; + else + break; + } if (unlikely(!tcp_snd_wnd_test(tp, skb, mss_now))) break; @@ -2049,10 +2059,120 @@ repair: if (likely(sent_pkts)) { if (tcp_in_cwnd_reduction(sk)) tp->prr_out += sent_pkts; + + /* Send one loss probe per tail loss episode. */ + if (push_one != 2) + tcp_schedule_loss_probe(sk); tcp_cwnd_validate(sk); return false; } - return !tp->packets_out && tcp_send_head(sk); + return (push_one == 2) || (!tp->packets_out && tcp_send_head(sk)); +} + +bool tcp_schedule_loss_probe(struct sock *sk) +{ + struct inet_connection_sock *icsk = inet_csk(sk); + struct tcp_sock *tp = tcp_sk(sk); + u32 timeout, tlp_time_stamp, rto_time_stamp; + u32 rtt = tp->srtt >> 3; + + if (WARN_ON(icsk->icsk_pending == ICSK_TIME_EARLY_RETRANS)) + return false; + /* No consecutive loss probes. */ + if (WARN_ON(icsk->icsk_pending == ICSK_TIME_LOSS_PROBE)) { + tcp_rearm_rto(sk); + return false; + } + /* Don't do any loss probe on a Fast Open connection before 3WHS + * finishes. + */ + if (sk->sk_state == TCP_SYN_RECV) + return false; + + /* TLP is only scheduled when next timer event is RTO. */ + if (icsk->icsk_pending != ICSK_TIME_RETRANS) + return false; + + /* Schedule a loss probe in 2*RTT for SACK capable connections + * in Open state, that are either limited by cwnd or application. + */ + if (sysctl_tcp_early_retrans < 3 || !rtt || !tp->packets_out || + !tcp_is_sack(tp) || inet_csk(sk)->icsk_ca_state != TCP_CA_Open) + return false; + + if ((tp->snd_cwnd > tcp_packets_in_flight(tp)) && + tcp_send_head(sk)) + return false; + + /* Probe timeout is at least 1.5*rtt + TCP_DELACK_MAX to account + * for delayed ack when there's one outstanding packet. + */ + timeout = rtt << 1; + if (tp->packets_out == 1) + timeout = max_t(u32, timeout, + (rtt + (rtt >> 1) + TCP_DELACK_MAX)); + timeout = max_t(u32, timeout, msecs_to_jiffies(10)); + + /* If RTO is shorter, just schedule TLP in its place. */ + tlp_time_stamp = tcp_time_stamp + timeout; + rto_time_stamp = (u32)inet_csk(sk)->icsk_timeout; + if ((s32)(tlp_time_stamp - rto_time_stamp) > 0) { + s32 delta = rto_time_stamp - tcp_time_stamp; + if (delta > 0) + timeout = delta; + } + + inet_csk_reset_xmit_timer(sk, ICSK_TIME_LOSS_PROBE, timeout, + TCP_RTO_MAX); + return true; +} + +/* When probe timeout (PTO) fires, send a new segment if one exists, else + * retransmit the last segment. + */ +void tcp_send_loss_probe(struct sock *sk) +{ + struct sk_buff *skb; + int pcount; + int mss = tcp_current_mss(sk); + int err = -1; + + if (tcp_send_head(sk) != NULL) { + err = tcp_write_xmit(sk, mss, TCP_NAGLE_OFF, 2, GFP_ATOMIC); + goto rearm_timer; + } + + /* Retransmit last segment. */ + skb = tcp_write_queue_tail(sk); + if (WARN_ON(!skb)) + goto rearm_timer; + + pcount = tcp_skb_pcount(skb); + if (WARN_ON(!pcount)) + goto rearm_timer; + + if ((pcount > 1) && (skb->len > (pcount - 1) * mss)) { + if (unlikely(tcp_fragment(sk, skb, (pcount - 1) * mss, mss))) + goto rearm_timer; + skb = tcp_write_queue_tail(sk); + } + + if (WARN_ON(!skb || !tcp_skb_pcount(skb))) + goto rearm_timer; + + /* Probe with zero data doesn't trigger fast recovery. */ + if (skb->len > 0) + err = __tcp_retransmit_skb(sk, skb); + +rearm_timer: + inet_csk_reset_xmit_timer(sk, ICSK_TIME_RETRANS, + inet_csk(sk)->icsk_rto, + TCP_RTO_MAX); + + if (likely(!err)) + NET_INC_STATS_BH(sock_net(sk), + LINUX_MIB_TCPLOSSPROBES); + return; } /* Push out any pending frames which were held back due to diff --git a/net/ipv4/tcp_timer.c b/net/ipv4/tcp_timer.c index b78aac30c498..ecd61d54147f 100644 --- a/net/ipv4/tcp_timer.c +++ b/net/ipv4/tcp_timer.c @@ -342,10 +342,6 @@ void tcp_retransmit_timer(struct sock *sk) struct tcp_sock *tp = tcp_sk(sk); struct inet_connection_sock *icsk = inet_csk(sk); - if (tp->early_retrans_delayed) { - tcp_resume_early_retransmit(sk); - return; - } if (tp->fastopen_rsk) { WARN_ON_ONCE(sk->sk_state != TCP_SYN_RECV && sk->sk_state != TCP_FIN_WAIT1); @@ -495,13 +491,20 @@ void tcp_write_timer_handler(struct sock *sk) } event = icsk->icsk_pending; - icsk->icsk_pending = 0; switch (event) { + case ICSK_TIME_EARLY_RETRANS: + tcp_resume_early_retransmit(sk); + break; + case ICSK_TIME_LOSS_PROBE: + tcp_send_loss_probe(sk); + break; case ICSK_TIME_RETRANS: + icsk->icsk_pending = 0; tcp_retransmit_timer(sk); break; case ICSK_TIME_PROBE0: + icsk->icsk_pending = 0; tcp_probe_timer(sk); break; } -- cgit v1.2.3 From 470d147428563aba9c2eb7c019383335249c6110 Mon Sep 17 00:00:00 2001 From: Mugunthan V N Date: Mon, 11 Mar 2013 23:16:34 +0000 Subject: documentation: dt: bindings: cpsw: cleanup documentation Move all the slave note properties to separate section to reduce the confusion between slave note properties and cpsw node properties Signed-off-by: Mugunthan V N Signed-off-by: David S. Miller --- Documentation/devicetree/bindings/net/cpsw.txt | 9 +++++++-- 1 file changed, 7 insertions(+), 2 deletions(-) (limited to 'Documentation') diff --git a/Documentation/devicetree/bindings/net/cpsw.txt b/Documentation/devicetree/bindings/net/cpsw.txt index ecfdf756d10f..8e49c4200928 100644 --- a/Documentation/devicetree/bindings/net/cpsw.txt +++ b/Documentation/devicetree/bindings/net/cpsw.txt @@ -18,13 +18,18 @@ Required properties: - cpts_active_slave : Specifies the slave to use for time stamping - cpts_clock_mult : Numerator to convert input clock ticks into nanoseconds - cpts_clock_shift : Denominator to convert input clock ticks into nanoseconds -- phy_id : Specifies slave phy id -- mac-address : Specifies slave MAC address Optional properties: - ti,hwmods : Must be "cpgmac0" - no_bd_ram : Must be 0 or 1 - dual_emac : Specifies Switch to act as Dual EMAC + +Slave Properties: +Required properties: +- phy_id : Specifies slave phy id +- mac-address : Specifies slave MAC address + +Optional properties: - dual_emac_res_vlan : Specifies VID to be used to segregate the ports Note: "ti,hwmods" field is used to fetch the base address and irq -- cgit v1.2.3 From e86ac13b031cf71d8f40ff513e627aac80e6b765 Mon Sep 17 00:00:00 2001 From: Mugunthan V N Date: Mon, 11 Mar 2013 23:16:35 +0000 Subject: drivers: net: ethernet: cpsw: change cpts_active_slave to active_slave Change cpts_active_slave to active_slave so that the same DT property can be used to ethtool and SIOCGMIIPHY. CC: Richard Cochran Signed-off-by: Mugunthan V N Signed-off-by: David S. Miller --- Documentation/devicetree/bindings/net/cpsw.txt | 7 ++++--- arch/arm/boot/dts/am33xx.dtsi | 2 +- drivers/net/ethernet/ti/cpsw.c | 10 +++++----- include/linux/platform_data/cpsw.h | 2 +- 4 files changed, 11 insertions(+), 10 deletions(-) (limited to 'Documentation') diff --git a/Documentation/devicetree/bindings/net/cpsw.txt b/Documentation/devicetree/bindings/net/cpsw.txt index 8e49c4200928..4f2ca6b4a182 100644 --- a/Documentation/devicetree/bindings/net/cpsw.txt +++ b/Documentation/devicetree/bindings/net/cpsw.txt @@ -15,7 +15,8 @@ Required properties: - mac_control : Specifies Default MAC control register content for the specific platform - slaves : Specifies number for slaves -- cpts_active_slave : Specifies the slave to use for time stamping +- active_slave : Specifies the slave to use for time stamping, + ethtool and SIOCGMIIPHY - cpts_clock_mult : Numerator to convert input clock ticks into nanoseconds - cpts_clock_shift : Denominator to convert input clock ticks into nanoseconds @@ -52,7 +53,7 @@ Examples: rx_descs = <64>; mac_control = <0x20>; slaves = <2>; - cpts_active_slave = <0>; + active_slave = <0>; cpts_clock_mult = <0x80000000>; cpts_clock_shift = <29>; cpsw_emac0: slave@0 { @@ -78,7 +79,7 @@ Examples: rx_descs = <64>; mac_control = <0x20>; slaves = <2>; - cpts_active_slave = <0>; + active_slave = <0>; cpts_clock_mult = <0x80000000>; cpts_clock_shift = <29>; cpsw_emac0: slave@0 { diff --git a/arch/arm/boot/dts/am33xx.dtsi b/arch/arm/boot/dts/am33xx.dtsi index 0957645b73af..91fe4f148f80 100644 --- a/arch/arm/boot/dts/am33xx.dtsi +++ b/arch/arm/boot/dts/am33xx.dtsi @@ -349,7 +349,7 @@ rx_descs = <64>; mac_control = <0x20>; slaves = <2>; - cpts_active_slave = <0>; + active_slave = <0>; cpts_clock_mult = <0x80000000>; cpts_clock_shift = <29>; reg = <0x4a100000 0x800 diff --git a/drivers/net/ethernet/ti/cpsw.c b/drivers/net/ethernet/ti/cpsw.c index 01ffbc486982..98aa17a9516a 100644 --- a/drivers/net/ethernet/ti/cpsw.c +++ b/drivers/net/ethernet/ti/cpsw.c @@ -942,7 +942,7 @@ static void cpsw_ndo_change_rx_flags(struct net_device *ndev, int flags) static void cpsw_hwtstamp_v1(struct cpsw_priv *priv) { - struct cpsw_slave *slave = &priv->slaves[priv->data.cpts_active_slave]; + struct cpsw_slave *slave = &priv->slaves[priv->data.active_slave]; u32 ts_en, seq_id; if (!priv->cpts->tx_enable && !priv->cpts->rx_enable) { @@ -971,7 +971,7 @@ static void cpsw_hwtstamp_v2(struct cpsw_priv *priv) if (priv->data.dual_emac) slave = &priv->slaves[priv->emac_port]; else - slave = &priv->slaves[priv->data.cpts_active_slave]; + slave = &priv->slaves[priv->data.active_slave]; ctrl = slave_read(slave, CPSW2_CONTROL); ctrl &= ~CTRL_ALL_TS_MASK; @@ -1282,12 +1282,12 @@ static int cpsw_probe_dt(struct cpsw_platform_data *data, } data->slaves = prop; - if (of_property_read_u32(node, "cpts_active_slave", &prop)) { - pr_err("Missing cpts_active_slave property in the DT.\n"); + if (of_property_read_u32(node, "active_slave", &prop)) { + pr_err("Missing active_slave property in the DT.\n"); ret = -EINVAL; goto error_ret; } - data->cpts_active_slave = prop; + data->active_slave = prop; if (of_property_read_u32(node, "cpts_clock_mult", &prop)) { pr_err("Missing cpts_clock_mult property in the DT.\n"); diff --git a/include/linux/platform_data/cpsw.h b/include/linux/platform_data/cpsw.h index 798fb80b024b..bb3cd58d71e3 100644 --- a/include/linux/platform_data/cpsw.h +++ b/include/linux/platform_data/cpsw.h @@ -30,7 +30,7 @@ struct cpsw_platform_data { u32 channels; /* number of cpdma channels (symmetric) */ u32 slaves; /* number of slave cpgmac ports */ struct cpsw_slave_data *slave_data; - u32 cpts_active_slave; /* time stamping slave */ + u32 active_slave; /* time stamping, ethtool and SIOCGMIIPHY slave */ u32 cpts_clock_mult; /* convert input clock ticks to nanoseconds */ u32 cpts_clock_shift; /* convert input clock ticks to nanoseconds */ u32 ale_entries; /* ale table size */ -- cgit v1.2.3 From d353d8d4d9f0184ac43a90c6e04b593c33bd28ea Mon Sep 17 00:00:00 2001 From: Martin Hundebøll Date: Fri, 25 Jan 2013 11:12:38 +0100 Subject: batman-adv: network coding - add the initial infrastructure code MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Network coding exploits the 802.11 shared medium to allow multiple packets to be sent in a single transmission. In brief, a relay can XOR two packets, and send the coded packet to two destinations. The receivers can decode one of the original packets by XOR'ing the coded packet with the other original packet. This will lead to increased throughput in topologies where two packets cross one relay. In a simple topology with three nodes, it takes four transmissions without network coding to get one packet from Node A to Node B and one from Node B to Node A: 1. Node A ---- p1 ---> Node R Node B 2. Node A Node R <--- p2 ---- Node B 3. Node A <--- p2 ---- Node R Node B 4. Node A Node R ---- p1 ---> Node B With network coding, the relay only needs one transmission, which saves us one slot of valuable airtime: 1. Node A ---- p1 ---> Node R Node B 2. Node A Node R <--- p2 ---- Node B 3. Node A <- p1 x p2 - Node R - p1 x p2 -> Node B The same principle holds for a topology including five nodes. Here the packets from Node A and Node B are overheard by Node C and Node D, respectively. This allows Node R to send a network coded packet to save one transmission: Node A Node B | \ / | | p1 p2 | | \ / | p1 > Node R < p2 | | | / \ | | p1 x p2 p1 x p2 | v / \ v / \ Node C < > Node D More information is available on the open-mesh.org wiki[1]. This patch adds the initial code to support network coding in batman-adv. It sets up a worker thread to do house keeping and adds a sysfs file to enable/disable network coding. The feature is disabled by default, as it requires a wifi-driver with working promiscuous mode, and also because it adds a small delay at each hop. [1] http://www.open-mesh.org/projects/batman-adv/wiki/Catwoman Signed-off-by: Martin Hundebøll Signed-off-by: Marek Lindner Signed-off-by: Antonio Quartulli --- Documentation/ABI/testing/sysfs-class-net-mesh | 8 +++ net/batman-adv/Kconfig | 14 +++++ net/batman-adv/Makefile | 1 + net/batman-adv/main.c | 6 ++ net/batman-adv/main.h | 4 +- net/batman-adv/network-coding.c | 81 ++++++++++++++++++++++++++ net/batman-adv/network-coding.h | 48 +++++++++++++++ net/batman-adv/soft-interface.c | 3 + net/batman-adv/sysfs.c | 6 ++ net/batman-adv/types.h | 14 +++++ 10 files changed, 184 insertions(+), 1 deletion(-) create mode 100644 net/batman-adv/network-coding.c create mode 100644 net/batman-adv/network-coding.h (limited to 'Documentation') diff --git a/Documentation/ABI/testing/sysfs-class-net-mesh b/Documentation/ABI/testing/sysfs-class-net-mesh index bc41da61608d..bdcd8b4e38f2 100644 --- a/Documentation/ABI/testing/sysfs-class-net-mesh +++ b/Documentation/ABI/testing/sysfs-class-net-mesh @@ -67,6 +67,14 @@ Description: Defines the penalty which will be applied to an originator message's tq-field on every hop. +What: /sys/class/net//mesh/network_coding +Date: Nov 2012 +Contact: Martin Hundeboll +Description: + Controls whether Network Coding (using some magic + to send fewer wifi packets but still the same + content) is enabled or not. + What: /sys/class/net//mesh/orig_interval Date: May 2010 Contact: Marek Lindner diff --git a/net/batman-adv/Kconfig b/net/batman-adv/Kconfig index 8d8afb134b3a..fa780b76630e 100644 --- a/net/batman-adv/Kconfig +++ b/net/batman-adv/Kconfig @@ -36,6 +36,20 @@ config BATMAN_ADV_DAT mesh networks. If you think that your network does not need this option you can safely remove it and save some space. +config BATMAN_ADV_NC + bool "Network Coding" + depends on BATMAN_ADV + default n + help + This option enables network coding, a mechanism that aims to + increase the overall network throughput by fusing multiple + packets in one transmission. + Note that interfaces controlled by batman-adv must be manually + configured to have promiscuous mode enabled in order to make + network coding work. + If you think that your network does not need this feature you + can safely disable it and save some space. + config BATMAN_ADV_DEBUG bool "B.A.T.M.A.N. debugging" depends on BATMAN_ADV diff --git a/net/batman-adv/Makefile b/net/batman-adv/Makefile index e45e3b4e32e3..4b8f192a9e43 100644 --- a/net/batman-adv/Makefile +++ b/net/batman-adv/Makefile @@ -30,6 +30,7 @@ batman-adv-y += hard-interface.o batman-adv-y += hash.o batman-adv-y += icmp_socket.o batman-adv-y += main.o +batman-adv-$(CONFIG_BATMAN_ADV_NC) += network-coding.o batman-adv-y += originator.o batman-adv-y += ring_buffer.o batman-adv-y += routing.o diff --git a/net/batman-adv/main.c b/net/batman-adv/main.c index 0488d70c8c35..0495a7dc7505 100644 --- a/net/batman-adv/main.c +++ b/net/batman-adv/main.c @@ -35,6 +35,7 @@ #include "vis.h" #include "hash.h" #include "bat_algo.h" +#include "network-coding.h" /* List manipulations on hardif_list have to be rtnl_lock()'ed, @@ -135,6 +136,10 @@ int batadv_mesh_init(struct net_device *soft_iface) if (ret < 0) goto err; + ret = batadv_nc_init(bat_priv); + if (ret < 0) + goto err; + atomic_set(&bat_priv->gw.reselect, 0); atomic_set(&bat_priv->mesh_state, BATADV_MESH_ACTIVE); @@ -157,6 +162,7 @@ void batadv_mesh_free(struct net_device *soft_iface) batadv_gw_node_purge(bat_priv); batadv_originator_free(bat_priv); + batadv_nc_free(bat_priv); batadv_tt_free(bat_priv); diff --git a/net/batman-adv/main.h b/net/batman-adv/main.h index ced08b936a96..59ba2ff8e252 100644 --- a/net/batman-adv/main.h +++ b/net/batman-adv/main.h @@ -185,6 +185,7 @@ __be32 batadv_skb_crc32(struct sk_buff *skb, u8 *payload_ptr); * @BATADV_DBG_TT: translation table messages * @BATADV_DBG_BLA: bridge loop avoidance messages * @BATADV_DBG_DAT: ARP snooping and DAT related messages + * @BATADV_DBG_NC: network coding related messages * @BATADV_DBG_ALL: the union of all the above log levels */ enum batadv_dbg_level { @@ -193,7 +194,8 @@ enum batadv_dbg_level { BATADV_DBG_TT = BIT(2), BATADV_DBG_BLA = BIT(3), BATADV_DBG_DAT = BIT(4), - BATADV_DBG_ALL = 31, + BATADV_DBG_NC = BIT(5), + BATADV_DBG_ALL = 63, }; #ifdef CONFIG_BATMAN_ADV_DEBUG diff --git a/net/batman-adv/network-coding.c b/net/batman-adv/network-coding.c new file mode 100644 index 000000000000..9c2d54bdd2ce --- /dev/null +++ b/net/batman-adv/network-coding.c @@ -0,0 +1,81 @@ +/* Copyright (C) 2012-2013 B.A.T.M.A.N. contributors: + * + * Martin Hundebøll, Jeppe Ledet-Pedersen + * + * This program is free software; you can redistribute it and/or + * modify it under the terms of version 2 of the GNU General Public + * License as published by the Free Software Foundation. + * + * This program is distributed in the hope that it will be useful, but + * WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + * General Public License for more details. + * + * You should have received a copy of the GNU General Public License + * along with this program; if not, write to the Free Software + * Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA + * 02110-1301, USA + */ + +#include "main.h" +#include "network-coding.h" + +static void batadv_nc_worker(struct work_struct *work); + +/** + * batadv_nc_start_timer - initialise the nc periodic worker + * @bat_priv: the bat priv with all the soft interface information + */ +static void batadv_nc_start_timer(struct batadv_priv *bat_priv) +{ + queue_delayed_work(batadv_event_workqueue, &bat_priv->nc.work, + msecs_to_jiffies(10)); +} + +/** + * batadv_nc_init - initialise coding hash table and start house keeping + * @bat_priv: the bat priv with all the soft interface information + */ +int batadv_nc_init(struct batadv_priv *bat_priv) +{ + INIT_DELAYED_WORK(&bat_priv->nc.work, batadv_nc_worker); + batadv_nc_start_timer(bat_priv); + + return 0; +} + +/** + * batadv_nc_init_bat_priv - initialise the nc specific bat_priv variables + * @bat_priv: the bat priv with all the soft interface information + */ +void batadv_nc_init_bat_priv(struct batadv_priv *bat_priv) +{ + atomic_set(&bat_priv->network_coding, 1); +} + +/** + * batadv_nc_worker - periodic task for house keeping related to network coding + * @work: kernel work struct + */ +static void batadv_nc_worker(struct work_struct *work) +{ + struct delayed_work *delayed_work; + struct batadv_priv_nc *priv_nc; + struct batadv_priv *bat_priv; + + delayed_work = container_of(work, struct delayed_work, work); + priv_nc = container_of(delayed_work, struct batadv_priv_nc, work); + bat_priv = container_of(priv_nc, struct batadv_priv, nc); + + /* Schedule a new check */ + batadv_nc_start_timer(bat_priv); +} + +/** + * batadv_nc_free - clean up network coding memory + * @bat_priv: the bat priv with all the soft interface information + */ +void batadv_nc_free(struct batadv_priv *bat_priv) +{ + cancel_delayed_work_sync(&bat_priv->nc.work); +} diff --git a/net/batman-adv/network-coding.h b/net/batman-adv/network-coding.h new file mode 100644 index 000000000000..7483cba4b5d4 --- /dev/null +++ b/net/batman-adv/network-coding.h @@ -0,0 +1,48 @@ +/* Copyright (C) 2012-2013 B.A.T.M.A.N. contributors: + * + * Martin Hundebøll, Jeppe Ledet-Pedersen + * + * This program is free software; you can redistribute it and/or + * modify it under the terms of version 2 of the GNU General Public + * License as published by the Free Software Foundation. + * + * This program is distributed in the hope that it will be useful, but + * WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + * General Public License for more details. + * + * You should have received a copy of the GNU General Public License + * along with this program; if not, write to the Free Software + * Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA + * 02110-1301, USA + */ + +#ifndef _NET_BATMAN_ADV_NETWORK_CODING_H_ +#define _NET_BATMAN_ADV_NETWORK_CODING_H_ + +#ifdef CONFIG_BATMAN_ADV_NC + +int batadv_nc_init(struct batadv_priv *bat_priv); +void batadv_nc_free(struct batadv_priv *bat_priv); +void batadv_nc_init_bat_priv(struct batadv_priv *bat_priv); + +#else /* ifdef CONFIG_BATMAN_ADV_NC */ + +static inline int batadv_nc_init(struct batadv_priv *bat_priv) +{ + return 0; +} + +static inline void batadv_nc_free(struct batadv_priv *bat_priv) +{ + return; +} + +static inline void batadv_nc_init_bat_priv(struct batadv_priv *bat_priv) +{ + return; +} + +#endif /* ifdef CONFIG_BATMAN_ADV_NC */ + +#endif /* _NET_BATMAN_ADV_NETWORK_CODING_H_ */ diff --git a/net/batman-adv/soft-interface.c b/net/batman-adv/soft-interface.c index 2711e870f557..7188e07dfc6f 100644 --- a/net/batman-adv/soft-interface.c +++ b/net/batman-adv/soft-interface.c @@ -37,6 +37,7 @@ #include #include "unicast.h" #include "bridge_loop_avoidance.h" +#include "network-coding.h" static int batadv_get_settings(struct net_device *dev, struct ethtool_cmd *cmd); @@ -544,6 +545,8 @@ struct net_device *batadv_softif_create(const char *name) if (ret < 0) goto unreg_soft_iface; + batadv_nc_init_bat_priv(bat_priv); + ret = batadv_sysfs_add_meshif(soft_iface); if (ret < 0) goto unreg_soft_iface; diff --git a/net/batman-adv/sysfs.c b/net/batman-adv/sysfs.c index 6a44fed12837..ce39f62f751e 100644 --- a/net/batman-adv/sysfs.c +++ b/net/batman-adv/sysfs.c @@ -442,6 +442,9 @@ static BATADV_ATTR(gw_bandwidth, S_IRUGO | S_IWUSR, batadv_show_gw_bwidth, #ifdef CONFIG_BATMAN_ADV_DEBUG BATADV_ATTR_SIF_UINT(log_level, S_IRUGO | S_IWUSR, 0, BATADV_DBG_ALL, NULL); #endif +#ifdef CONFIG_BATMAN_ADV_NC +BATADV_ATTR_SIF_BOOL(network_coding, S_IRUGO | S_IWUSR, NULL); +#endif static struct batadv_attribute *batadv_mesh_attrs[] = { &batadv_attr_aggregated_ogms, @@ -463,6 +466,9 @@ static struct batadv_attribute *batadv_mesh_attrs[] = { &batadv_attr_gw_bandwidth, #ifdef CONFIG_BATMAN_ADV_DEBUG &batadv_attr_log_level, +#endif +#ifdef CONFIG_BATMAN_ADV_NC + &batadv_attr_network_coding, #endif NULL, }; diff --git a/net/batman-adv/types.h b/net/batman-adv/types.h index 4cd87a0b5b80..83bfe7c38f81 100644 --- a/net/batman-adv/types.h +++ b/net/batman-adv/types.h @@ -427,6 +427,14 @@ struct batadv_priv_dat { }; #endif +/** + * struct batadv_priv_nc - per mesh interface network coding private data + * @work: work queue callback item for cleanup + */ +struct batadv_priv_nc { + struct delayed_work work; +}; + /** * struct batadv_priv - per mesh interface data * @mesh_state: current status of the mesh (inactive/active/deactivating) @@ -470,6 +478,8 @@ struct batadv_priv_dat { * @tt: translation table data * @vis: vis data * @dat: distributed arp table data + * @network_coding: bool indicating whether network coding is enabled + * @batadv_priv_nc: network coding data */ struct batadv_priv { atomic_t mesh_state; @@ -522,6 +532,10 @@ struct batadv_priv { #ifdef CONFIG_BATMAN_ADV_DAT struct batadv_priv_dat dat; #endif +#ifdef CONFIG_BATMAN_ADV_NC + atomic_t network_coding; + struct batadv_priv_nc nc; +#endif /* CONFIG_BATMAN_ADV_NC */ }; /** -- cgit v1.2.3 From b66c66dc5cc8f8f8d68ea1177b9672f91e1e7a19 Mon Sep 17 00:00:00 2001 From: Li RongQing Date: Thu, 14 Mar 2013 22:49:47 +0000 Subject: Documentation: fix neigh/default/gc_thresh1 default value. The default value is 128, not 256 #grep gc_thresh1 net/ -rI net/decnet/dn_neigh.c: .gc_thresh1 = 128, net/ipv6/ndisc.c: .gc_thresh1 = 128, net/ipv4/arp.c: .gc_thresh1 = 128, Signed-off-by: Li RongQing Signed-off-by: David S. Miller --- Documentation/networking/ip-sysctl.txt | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) (limited to 'Documentation') diff --git a/Documentation/networking/ip-sysctl.txt b/Documentation/networking/ip-sysctl.txt index 1cae6c383e1b..18a24c405ac0 100644 --- a/Documentation/networking/ip-sysctl.txt +++ b/Documentation/networking/ip-sysctl.txt @@ -29,7 +29,7 @@ route/max_size - INTEGER neigh/default/gc_thresh1 - INTEGER Minimum number of entries to keep. Garbage collector will not purge entries if there are fewer than this number. - Default: 256 + Default: 128 neigh/default/gc_thresh3 - INTEGER Maximum number of neighbor entries allowed. Increase this -- cgit v1.2.3 From 1a2c6181c4a1922021b4d7df373bba612c3e5f04 Mon Sep 17 00:00:00 2001 From: Christoph Paasch Date: Sun, 17 Mar 2013 08:23:34 +0000 Subject: tcp: Remove TCPCT TCPCT uses option-number 253, reserved for experimental use and should not be used in production environments. Further, TCPCT does not fully implement RFC 6013. As a nice side-effect, removing TCPCT increases TCP's performance for very short flows: Doing an apache-benchmark with -c 100 -n 100000, sending HTTP-requests for files of 1KB size. before this patch: average (among 7 runs) of 20845.5 Requests/Second after: average (among 7 runs) of 21403.6 Requests/Second Signed-off-by: Christoph Paasch Signed-off-by: David S. Miller --- Documentation/networking/ip-sysctl.txt | 8 - drivers/infiniband/hw/cxgb4/cm.c | 2 +- include/linux/tcp.h | 10 -- include/net/request_sock.h | 8 +- include/net/tcp.h | 89 +---------- include/uapi/linux/tcp.h | 26 ---- net/dccp/ipv4.c | 5 +- net/dccp/ipv6.c | 5 +- net/ipv4/inet_connection_sock.c | 2 +- net/ipv4/syncookies.c | 3 +- net/ipv4/sysctl_net_ipv4.c | 7 - net/ipv4/tcp.c | 267 --------------------------------- net/ipv4/tcp_input.c | 69 +-------- net/ipv4/tcp_ipv4.c | 60 +------- net/ipv4/tcp_minisocks.c | 40 +---- net/ipv4/tcp_output.c | 219 +-------------------------- net/ipv6/syncookies.c | 3 +- net/ipv6/tcp_ipv6.c | 56 +------ 18 files changed, 38 insertions(+), 841 deletions(-) (limited to 'Documentation') diff --git a/Documentation/networking/ip-sysctl.txt b/Documentation/networking/ip-sysctl.txt index 18a24c405ac0..17953e2bc3e9 100644 --- a/Documentation/networking/ip-sysctl.txt +++ b/Documentation/networking/ip-sysctl.txt @@ -175,14 +175,6 @@ tcp_congestion_control - STRING is inherited. [see setsockopt(listenfd, SOL_TCP, TCP_CONGESTION, "name" ...) ] -tcp_cookie_size - INTEGER - Default size of TCP Cookie Transactions (TCPCT) option, that may be - overridden on a per socket basis by the TCPCT socket option. - Values greater than the maximum (16) are interpreted as the maximum. - Values greater than zero and less than the minimum (8) are interpreted - as the minimum. Odd values are interpreted as the next even value. - Default: 0 (off). - tcp_dsack - BOOLEAN Allows TCP to send "duplicate" SACKs. diff --git a/drivers/infiniband/hw/cxgb4/cm.c b/drivers/infiniband/hw/cxgb4/cm.c index 8dcc84fd9d30..54fd31fcc332 100644 --- a/drivers/infiniband/hw/cxgb4/cm.c +++ b/drivers/infiniband/hw/cxgb4/cm.c @@ -2915,7 +2915,7 @@ static void build_cpl_pass_accept_req(struct sk_buff *skb, int stid , u8 tos) */ memset(&tmp_opt, 0, sizeof(tmp_opt)); tcp_clear_options(&tmp_opt); - tcp_parse_options(skb, &tmp_opt, NULL, 0, NULL); + tcp_parse_options(skb, &tmp_opt, 0, NULL); req = (struct cpl_pass_accept_req *)__skb_push(skb, sizeof(*req)); memset(req, 0, sizeof(*req)); diff --git a/include/linux/tcp.h b/include/linux/tcp.h index 763c108ee03d..ed6a7456eecd 100644 --- a/include/linux/tcp.h +++ b/include/linux/tcp.h @@ -90,9 +90,6 @@ struct tcp_options_received { sack_ok : 4, /* SACK seen on SYN packet */ snd_wscale : 4, /* Window scaling received from sender */ rcv_wscale : 4; /* Window scaling to send to receiver */ - u8 cookie_plus:6, /* bytes in authenticator/cookie option */ - cookie_out_never:1, - cookie_in_always:1; u8 num_sacks; /* Number of SACK blocks */ u16 user_mss; /* mss requested by user in ioctl */ u16 mss_clamp; /* Maximal mss, negotiated at connection setup */ @@ -102,7 +99,6 @@ static inline void tcp_clear_options(struct tcp_options_received *rx_opt) { rx_opt->tstamp_ok = rx_opt->sack_ok = 0; rx_opt->wscale_ok = rx_opt->snd_wscale = 0; - rx_opt->cookie_plus = 0; } /* This is the max number of SACKS that we'll generate and process. It's safe @@ -320,12 +316,6 @@ struct tcp_sock { struct tcp_md5sig_info __rcu *md5sig_info; #endif - /* When the cookie options are generated and exchanged, then this - * object holds a reference to them (cookie_values->kref). Also - * contains related tcp_cookie_transactions fields. - */ - struct tcp_cookie_values *cookie_values; - /* TCP fastopen related information */ struct tcp_fastopen_request *fastopen_req; /* fastopen_rsk points to request_sock that resulted in this big diff --git a/include/net/request_sock.h b/include/net/request_sock.h index a51dbd17c2de..9069e65c1c56 100644 --- a/include/net/request_sock.h +++ b/include/net/request_sock.h @@ -27,19 +27,13 @@ struct sk_buff; struct dst_entry; struct proto; -/* empty to "strongly type" an otherwise void parameter. - */ -struct request_values { -}; - struct request_sock_ops { int family; int obj_size; struct kmem_cache *slab; char *slab_name; int (*rtx_syn_ack)(struct sock *sk, - struct request_sock *req, - struct request_values *rvp); + struct request_sock *req); void (*send_ack)(struct sock *sk, struct sk_buff *skb, struct request_sock *req); void (*send_reset)(struct sock *sk, diff --git a/include/net/tcp.h b/include/net/tcp.h index ab9f947b118b..7f2f17198d75 100644 --- a/include/net/tcp.h +++ b/include/net/tcp.h @@ -179,7 +179,6 @@ extern void tcp_time_wait(struct sock *sk, int state, int timeo); #define TCPOPT_SACK 5 /* SACK Block */ #define TCPOPT_TIMESTAMP 8 /* Better RTT estimations/PAWS */ #define TCPOPT_MD5SIG 19 /* MD5 Signature (RFC2385) */ -#define TCPOPT_COOKIE 253 /* Cookie extension (experimental) */ #define TCPOPT_EXP 254 /* Experimental */ /* Magic number to be after the option value for sharing TCP * experimental options. See draft-ietf-tcpm-experimental-options-00.txt @@ -454,7 +453,7 @@ extern void tcp_syn_ack_timeout(struct sock *sk, struct request_sock *req); extern int tcp_recvmsg(struct kiocb *iocb, struct sock *sk, struct msghdr *msg, size_t len, int nonblock, int flags, int *addr_len); extern void tcp_parse_options(const struct sk_buff *skb, - struct tcp_options_received *opt_rx, const u8 **hvpp, + struct tcp_options_received *opt_rx, int estab, struct tcp_fastopen_cookie *foc); extern const u8 *tcp_parse_md5sig_option(const struct tcphdr *th); @@ -476,7 +475,6 @@ extern int tcp_v4_connect(struct sock *sk, struct sockaddr *uaddr, extern int tcp_connect(struct sock *sk); extern struct sk_buff * tcp_make_synack(struct sock *sk, struct dst_entry *dst, struct request_sock *req, - struct request_values *rvp, struct tcp_fastopen_cookie *foc); extern int tcp_disconnect(struct sock *sk, int flags); @@ -1589,91 +1587,6 @@ struct tcp_request_sock_ops { #endif }; -/* Using SHA1 for now, define some constants. - */ -#define COOKIE_DIGEST_WORDS (SHA_DIGEST_WORDS) -#define COOKIE_MESSAGE_WORDS (SHA_MESSAGE_BYTES / 4) -#define COOKIE_WORKSPACE_WORDS (COOKIE_DIGEST_WORDS + COOKIE_MESSAGE_WORDS) - -extern int tcp_cookie_generator(u32 *bakery); - -/** - * struct tcp_cookie_values - each socket needs extra space for the - * cookies, together with (optional) space for any SYN data. - * - * A tcp_sock contains a pointer to the current value, and this is - * cloned to the tcp_timewait_sock. - * - * @cookie_pair: variable data from the option exchange. - * - * @cookie_desired: user specified tcpct_cookie_desired. Zero - * indicates default (sysctl_tcp_cookie_size). - * After cookie sent, remembers size of cookie. - * Range 0, TCP_COOKIE_MIN to TCP_COOKIE_MAX. - * - * @s_data_desired: user specified tcpct_s_data_desired. When the - * constant payload is specified (@s_data_constant), - * holds its length instead. - * Range 0 to TCP_MSS_DESIRED. - * - * @s_data_payload: constant data that is to be included in the - * payload of SYN or SYNACK segments when the - * cookie option is present. - */ -struct tcp_cookie_values { - struct kref kref; - u8 cookie_pair[TCP_COOKIE_PAIR_SIZE]; - u8 cookie_pair_size; - u8 cookie_desired; - u16 s_data_desired:11, - s_data_constant:1, - s_data_in:1, - s_data_out:1, - s_data_unused:2; - u8 s_data_payload[0]; -}; - -static inline void tcp_cookie_values_release(struct kref *kref) -{ - kfree(container_of(kref, struct tcp_cookie_values, kref)); -} - -/* The length of constant payload data. Note that s_data_desired is - * overloaded, depending on s_data_constant: either the length of constant - * data (returned here) or the limit on variable data. - */ -static inline int tcp_s_data_size(const struct tcp_sock *tp) -{ - return (tp->cookie_values != NULL && tp->cookie_values->s_data_constant) - ? tp->cookie_values->s_data_desired - : 0; -} - -/** - * struct tcp_extend_values - tcp_ipv?.c to tcp_output.c workspace. - * - * As tcp_request_sock has already been extended in other places, the - * only remaining method is to pass stack values along as function - * parameters. These parameters are not needed after sending SYNACK. - * - * @cookie_bakery: cryptographic secret and message workspace. - * - * @cookie_plus: bytes in authenticator/cookie option, copied from - * struct tcp_options_received (above). - */ -struct tcp_extend_values { - struct request_values rv; - u32 cookie_bakery[COOKIE_WORKSPACE_WORDS]; - u8 cookie_plus:6, - cookie_out_never:1, - cookie_in_always:1; -}; - -static inline struct tcp_extend_values *tcp_xv(struct request_values *rvp) -{ - return (struct tcp_extend_values *)rvp; -} - extern void tcp_v4_init(void); extern void tcp_init(void); diff --git a/include/uapi/linux/tcp.h b/include/uapi/linux/tcp.h index 6b1ead0b0c9d..8d776ebc4829 100644 --- a/include/uapi/linux/tcp.h +++ b/include/uapi/linux/tcp.h @@ -102,7 +102,6 @@ enum { #define TCP_QUICKACK 12 /* Block/reenable quick acks */ #define TCP_CONGESTION 13 /* Congestion control algorithm */ #define TCP_MD5SIG 14 /* TCP MD5 Signature (RFC2385) */ -#define TCP_COOKIE_TRANSACTIONS 15 /* TCP Cookie Transactions */ #define TCP_THIN_LINEAR_TIMEOUTS 16 /* Use linear timeouts for thin streams*/ #define TCP_THIN_DUPACK 17 /* Fast retrans. after 1 dupack */ #define TCP_USER_TIMEOUT 18 /* How long for loss retry before timeout */ @@ -199,29 +198,4 @@ struct tcp_md5sig { __u8 tcpm_key[TCP_MD5SIG_MAXKEYLEN]; /* key (binary) */ }; -/* for TCP_COOKIE_TRANSACTIONS (TCPCT) socket option */ -#define TCP_COOKIE_MIN 8 /* 64-bits */ -#define TCP_COOKIE_MAX 16 /* 128-bits */ -#define TCP_COOKIE_PAIR_SIZE (2*TCP_COOKIE_MAX) - -/* Flags for both getsockopt and setsockopt */ -#define TCP_COOKIE_IN_ALWAYS (1 << 0) /* Discard SYN without cookie */ -#define TCP_COOKIE_OUT_NEVER (1 << 1) /* Prohibit outgoing cookies, - * supercedes everything. */ - -/* Flags for getsockopt */ -#define TCP_S_DATA_IN (1 << 2) /* Was data received? */ -#define TCP_S_DATA_OUT (1 << 3) /* Was data sent? */ - -/* TCP_COOKIE_TRANSACTIONS data */ -struct tcp_cookie_transactions { - __u16 tcpct_flags; /* see above */ - __u8 __tcpct_pad1; /* zero */ - __u8 tcpct_cookie_desired; /* bytes */ - __u16 tcpct_s_data_desired; /* bytes of variable data */ - __u16 tcpct_used; /* bytes in value */ - __u8 tcpct_value[TCP_MSS_DEFAULT]; -}; - - #endif /* _UAPI_LINUX_TCP_H */ diff --git a/net/dccp/ipv4.c b/net/dccp/ipv4.c index 4f9f5eb478f1..ebc54fef85a5 100644 --- a/net/dccp/ipv4.c +++ b/net/dccp/ipv4.c @@ -500,8 +500,7 @@ static struct dst_entry* dccp_v4_route_skb(struct net *net, struct sock *sk, return &rt->dst; } -static int dccp_v4_send_response(struct sock *sk, struct request_sock *req, - struct request_values *rv_unused) +static int dccp_v4_send_response(struct sock *sk, struct request_sock *req) { int err = -1; struct sk_buff *skb; @@ -658,7 +657,7 @@ int dccp_v4_conn_request(struct sock *sk, struct sk_buff *skb) dreq->dreq_gss = dreq->dreq_iss; dreq->dreq_service = service; - if (dccp_v4_send_response(sk, req, NULL)) + if (dccp_v4_send_response(sk, req)) goto drop_and_free; inet_csk_reqsk_queue_hash_add(sk, req, DCCP_TIMEOUT_INIT); diff --git a/net/dccp/ipv6.c b/net/dccp/ipv6.c index 6e05981f271e..9c61f9c02fdb 100644 --- a/net/dccp/ipv6.c +++ b/net/dccp/ipv6.c @@ -213,8 +213,7 @@ out: } -static int dccp_v6_send_response(struct sock *sk, struct request_sock *req, - struct request_values *rv_unused) +static int dccp_v6_send_response(struct sock *sk, struct request_sock *req) { struct inet6_request_sock *ireq6 = inet6_rsk(req); struct ipv6_pinfo *np = inet6_sk(sk); @@ -428,7 +427,7 @@ static int dccp_v6_conn_request(struct sock *sk, struct sk_buff *skb) dreq->dreq_gss = dreq->dreq_iss; dreq->dreq_service = service; - if (dccp_v6_send_response(sk, req, NULL)) + if (dccp_v6_send_response(sk, req)) goto drop_and_free; inet6_csk_reqsk_queue_hash_add(sk, req, DCCP_TIMEOUT_INIT); diff --git a/net/ipv4/inet_connection_sock.c b/net/ipv4/inet_connection_sock.c index 786d97aee751..6acb541c9091 100644 --- a/net/ipv4/inet_connection_sock.c +++ b/net/ipv4/inet_connection_sock.c @@ -559,7 +559,7 @@ static inline void syn_ack_recalc(struct request_sock *req, const int thresh, int inet_rtx_syn_ack(struct sock *parent, struct request_sock *req) { - int err = req->rsk_ops->rtx_syn_ack(parent, req, NULL); + int err = req->rsk_ops->rtx_syn_ack(parent, req); if (!err) req->num_retrans++; diff --git a/net/ipv4/syncookies.c b/net/ipv4/syncookies.c index ef54377fb11c..7f4a5cb8f8d0 100644 --- a/net/ipv4/syncookies.c +++ b/net/ipv4/syncookies.c @@ -267,7 +267,6 @@ struct sock *cookie_v4_check(struct sock *sk, struct sk_buff *skb, struct ip_options *opt) { struct tcp_options_received tcp_opt; - const u8 *hash_location; struct inet_request_sock *ireq; struct tcp_request_sock *treq; struct tcp_sock *tp = tcp_sk(sk); @@ -294,7 +293,7 @@ struct sock *cookie_v4_check(struct sock *sk, struct sk_buff *skb, /* check for timestamp cookie support */ memset(&tcp_opt, 0, sizeof(tcp_opt)); - tcp_parse_options(skb, &tcp_opt, &hash_location, 0, NULL); + tcp_parse_options(skb, &tcp_opt, 0, NULL); if (!cookie_check_timestamp(&tcp_opt, sock_net(sk), &ecn_ok)) goto out; diff --git a/net/ipv4/sysctl_net_ipv4.c b/net/ipv4/sysctl_net_ipv4.c index cca4550f4082..cb45062c8be0 100644 --- a/net/ipv4/sysctl_net_ipv4.c +++ b/net/ipv4/sysctl_net_ipv4.c @@ -732,13 +732,6 @@ static struct ctl_table ipv4_table[] = { .mode = 0644, .proc_handler = proc_dointvec, }, - { - .procname = "tcp_cookie_size", - .data = &sysctl_tcp_cookie_size, - .maxlen = sizeof(int), - .mode = 0644, - .proc_handler = proc_dointvec - }, { .procname = "tcp_thin_linear_timeouts", .data = &sysctl_tcp_thin_linear_timeouts, diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c index 8d14573ade77..17a6810af5c8 100644 --- a/net/ipv4/tcp.c +++ b/net/ipv4/tcp.c @@ -409,15 +409,6 @@ void tcp_init_sock(struct sock *sk) icsk->icsk_sync_mss = tcp_sync_mss; - /* TCP Cookie Transactions */ - if (sysctl_tcp_cookie_size > 0) { - /* Default, cookies without s_data_payload. */ - tp->cookie_values = - kzalloc(sizeof(*tp->cookie_values), - sk->sk_allocation); - if (tp->cookie_values != NULL) - kref_init(&tp->cookie_values->kref); - } /* Presumed zeroed, in order of appearance: * cookie_in_always, cookie_out_never, * s_data_constant, s_data_in, s_data_out @@ -2397,92 +2388,6 @@ static int do_tcp_setsockopt(struct sock *sk, int level, release_sock(sk); return err; } - case TCP_COOKIE_TRANSACTIONS: { - struct tcp_cookie_transactions ctd; - struct tcp_cookie_values *cvp = NULL; - - if (sizeof(ctd) > optlen) - return -EINVAL; - if (copy_from_user(&ctd, optval, sizeof(ctd))) - return -EFAULT; - - if (ctd.tcpct_used > sizeof(ctd.tcpct_value) || - ctd.tcpct_s_data_desired > TCP_MSS_DESIRED) - return -EINVAL; - - if (ctd.tcpct_cookie_desired == 0) { - /* default to global value */ - } else if ((0x1 & ctd.tcpct_cookie_desired) || - ctd.tcpct_cookie_desired > TCP_COOKIE_MAX || - ctd.tcpct_cookie_desired < TCP_COOKIE_MIN) { - return -EINVAL; - } - - if (TCP_COOKIE_OUT_NEVER & ctd.tcpct_flags) { - /* Supercedes all other values */ - lock_sock(sk); - if (tp->cookie_values != NULL) { - kref_put(&tp->cookie_values->kref, - tcp_cookie_values_release); - tp->cookie_values = NULL; - } - tp->rx_opt.cookie_in_always = 0; /* false */ - tp->rx_opt.cookie_out_never = 1; /* true */ - release_sock(sk); - return err; - } - - /* Allocate ancillary memory before locking. - */ - if (ctd.tcpct_used > 0 || - (tp->cookie_values == NULL && - (sysctl_tcp_cookie_size > 0 || - ctd.tcpct_cookie_desired > 0 || - ctd.tcpct_s_data_desired > 0))) { - cvp = kzalloc(sizeof(*cvp) + ctd.tcpct_used, - GFP_KERNEL); - if (cvp == NULL) - return -ENOMEM; - - kref_init(&cvp->kref); - } - lock_sock(sk); - tp->rx_opt.cookie_in_always = - (TCP_COOKIE_IN_ALWAYS & ctd.tcpct_flags); - tp->rx_opt.cookie_out_never = 0; /* false */ - - if (tp->cookie_values != NULL) { - if (cvp != NULL) { - /* Changed values are recorded by a changed - * pointer, ensuring the cookie will differ, - * without separately hashing each value later. - */ - kref_put(&tp->cookie_values->kref, - tcp_cookie_values_release); - } else { - cvp = tp->cookie_values; - } - } - - if (cvp != NULL) { - cvp->cookie_desired = ctd.tcpct_cookie_desired; - - if (ctd.tcpct_used > 0) { - memcpy(cvp->s_data_payload, ctd.tcpct_value, - ctd.tcpct_used); - cvp->s_data_desired = ctd.tcpct_used; - cvp->s_data_constant = 1; /* true */ - } else { - /* No constant payload data. */ - cvp->s_data_desired = ctd.tcpct_s_data_desired; - cvp->s_data_constant = 0; /* false */ - } - - tp->cookie_values = cvp; - } - release_sock(sk); - return err; - } default: /* fallthru */ break; @@ -2902,41 +2807,6 @@ static int do_tcp_getsockopt(struct sock *sk, int level, return -EFAULT; return 0; - case TCP_COOKIE_TRANSACTIONS: { - struct tcp_cookie_transactions ctd; - struct tcp_cookie_values *cvp = tp->cookie_values; - - if (get_user(len, optlen)) - return -EFAULT; - if (len < sizeof(ctd)) - return -EINVAL; - - memset(&ctd, 0, sizeof(ctd)); - ctd.tcpct_flags = (tp->rx_opt.cookie_in_always ? - TCP_COOKIE_IN_ALWAYS : 0) - | (tp->rx_opt.cookie_out_never ? - TCP_COOKIE_OUT_NEVER : 0); - - if (cvp != NULL) { - ctd.tcpct_flags |= (cvp->s_data_in ? - TCP_S_DATA_IN : 0) - | (cvp->s_data_out ? - TCP_S_DATA_OUT : 0); - - ctd.tcpct_cookie_desired = cvp->cookie_desired; - ctd.tcpct_s_data_desired = cvp->s_data_desired; - - memcpy(&ctd.tcpct_value[0], &cvp->cookie_pair[0], - cvp->cookie_pair_size); - ctd.tcpct_used = cvp->cookie_pair_size; - } - - if (put_user(sizeof(ctd), optlen)) - return -EFAULT; - if (copy_to_user(optval, &ctd, sizeof(ctd))) - return -EFAULT; - return 0; - } case TCP_THIN_LINEAR_TIMEOUTS: val = tp->thin_lto; break; @@ -3409,134 +3279,6 @@ EXPORT_SYMBOL(tcp_md5_hash_key); #endif -/* Each Responder maintains up to two secret values concurrently for - * efficient secret rollover. Each secret value has 4 states: - * - * Generating. (tcp_secret_generating != tcp_secret_primary) - * Generates new Responder-Cookies, but not yet used for primary - * verification. This is a short-term state, typically lasting only - * one round trip time (RTT). - * - * Primary. (tcp_secret_generating == tcp_secret_primary) - * Used both for generation and primary verification. - * - * Retiring. (tcp_secret_retiring != tcp_secret_secondary) - * Used for verification, until the first failure that can be - * verified by the newer Generating secret. At that time, this - * cookie's state is changed to Secondary, and the Generating - * cookie's state is changed to Primary. This is a short-term state, - * typically lasting only one round trip time (RTT). - * - * Secondary. (tcp_secret_retiring == tcp_secret_secondary) - * Used for secondary verification, after primary verification - * failures. This state lasts no more than twice the Maximum Segment - * Lifetime (2MSL). Then, the secret is discarded. - */ -struct tcp_cookie_secret { - /* The secret is divided into two parts. The digest part is the - * equivalent of previously hashing a secret and saving the state, - * and serves as an initialization vector (IV). The message part - * serves as the trailing secret. - */ - u32 secrets[COOKIE_WORKSPACE_WORDS]; - unsigned long expires; -}; - -#define TCP_SECRET_1MSL (HZ * TCP_PAWS_MSL) -#define TCP_SECRET_2MSL (HZ * TCP_PAWS_MSL * 2) -#define TCP_SECRET_LIFE (HZ * 600) - -static struct tcp_cookie_secret tcp_secret_one; -static struct tcp_cookie_secret tcp_secret_two; - -/* Essentially a circular list, without dynamic allocation. */ -static struct tcp_cookie_secret *tcp_secret_generating; -static struct tcp_cookie_secret *tcp_secret_primary; -static struct tcp_cookie_secret *tcp_secret_retiring; -static struct tcp_cookie_secret *tcp_secret_secondary; - -static DEFINE_SPINLOCK(tcp_secret_locker); - -/* Select a pseudo-random word in the cookie workspace. - */ -static inline u32 tcp_cookie_work(const u32 *ws, const int n) -{ - return ws[COOKIE_DIGEST_WORDS + ((COOKIE_MESSAGE_WORDS-1) & ws[n])]; -} - -/* Fill bakery[COOKIE_WORKSPACE_WORDS] with generator, updating as needed. - * Called in softirq context. - * Returns: 0 for success. - */ -int tcp_cookie_generator(u32 *bakery) -{ - unsigned long jiffy = jiffies; - - if (unlikely(time_after_eq(jiffy, tcp_secret_generating->expires))) { - spin_lock_bh(&tcp_secret_locker); - if (!time_after_eq(jiffy, tcp_secret_generating->expires)) { - /* refreshed by another */ - memcpy(bakery, - &tcp_secret_generating->secrets[0], - COOKIE_WORKSPACE_WORDS); - } else { - /* still needs refreshing */ - get_random_bytes(bakery, COOKIE_WORKSPACE_WORDS); - - /* The first time, paranoia assumes that the - * randomization function isn't as strong. But, - * this secret initialization is delayed until - * the last possible moment (packet arrival). - * Although that time is observable, it is - * unpredictably variable. Mash in the most - * volatile clock bits available, and expire the - * secret extra quickly. - */ - if (unlikely(tcp_secret_primary->expires == - tcp_secret_secondary->expires)) { - struct timespec tv; - - getnstimeofday(&tv); - bakery[COOKIE_DIGEST_WORDS+0] ^= - (u32)tv.tv_nsec; - - tcp_secret_secondary->expires = jiffy - + TCP_SECRET_1MSL - + (0x0f & tcp_cookie_work(bakery, 0)); - } else { - tcp_secret_secondary->expires = jiffy - + TCP_SECRET_LIFE - + (0xff & tcp_cookie_work(bakery, 1)); - tcp_secret_primary->expires = jiffy - + TCP_SECRET_2MSL - + (0x1f & tcp_cookie_work(bakery, 2)); - } - memcpy(&tcp_secret_secondary->secrets[0], - bakery, COOKIE_WORKSPACE_WORDS); - - rcu_assign_pointer(tcp_secret_generating, - tcp_secret_secondary); - rcu_assign_pointer(tcp_secret_retiring, - tcp_secret_primary); - /* - * Neither call_rcu() nor synchronize_rcu() needed. - * Retiring data is not freed. It is replaced after - * further (locked) pointer updates, and a quiet time - * (minimum 1MSL, maximum LIFE - 2MSL). - */ - } - spin_unlock_bh(&tcp_secret_locker); - } else { - rcu_read_lock_bh(); - memcpy(bakery, - &rcu_dereference(tcp_secret_generating)->secrets[0], - COOKIE_WORKSPACE_WORDS); - rcu_read_unlock_bh(); - } - return 0; -} -EXPORT_SYMBOL(tcp_cookie_generator); - void tcp_done(struct sock *sk) { struct request_sock *req = tcp_sk(sk)->fastopen_rsk; @@ -3591,7 +3333,6 @@ void __init tcp_init(void) unsigned long limit; int max_rshare, max_wshare, cnt; unsigned int i; - unsigned long jiffy = jiffies; BUILD_BUG_ON(sizeof(struct tcp_skb_cb) > sizeof(skb->cb)); @@ -3667,13 +3408,5 @@ void __init tcp_init(void) tcp_register_congestion_control(&tcp_reno); - memset(&tcp_secret_one.secrets[0], 0, sizeof(tcp_secret_one.secrets)); - memset(&tcp_secret_two.secrets[0], 0, sizeof(tcp_secret_two.secrets)); - tcp_secret_one.expires = jiffy; /* past due */ - tcp_secret_two.expires = jiffy; /* past due */ - tcp_secret_generating = &tcp_secret_one; - tcp_secret_primary = &tcp_secret_one; - tcp_secret_retiring = &tcp_secret_two; - tcp_secret_secondary = &tcp_secret_two; tcp_tasklet_init(); } diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c index 836d74dd0187..19f0149fb6a2 100644 --- a/net/ipv4/tcp_input.c +++ b/net/ipv4/tcp_input.c @@ -3760,8 +3760,8 @@ old_ack: * But, this can also be called on packets in the established flow when * the fast version below fails. */ -void tcp_parse_options(const struct sk_buff *skb, struct tcp_options_received *opt_rx, - const u8 **hvpp, int estab, +void tcp_parse_options(const struct sk_buff *skb, + struct tcp_options_received *opt_rx, int estab, struct tcp_fastopen_cookie *foc) { const unsigned char *ptr; @@ -3845,31 +3845,6 @@ void tcp_parse_options(const struct sk_buff *skb, struct tcp_options_received *o */ break; #endif - case TCPOPT_COOKIE: - /* This option is variable length. - */ - switch (opsize) { - case TCPOLEN_COOKIE_BASE: - /* not yet implemented */ - break; - case TCPOLEN_COOKIE_PAIR: - /* not yet implemented */ - break; - case TCPOLEN_COOKIE_MIN+0: - case TCPOLEN_COOKIE_MIN+2: - case TCPOLEN_COOKIE_MIN+4: - case TCPOLEN_COOKIE_MIN+6: - case TCPOLEN_COOKIE_MAX: - /* 16-bit multiple */ - opt_rx->cookie_plus = opsize; - *hvpp = ptr; - break; - default: - /* ignore option */ - break; - } - break; - case TCPOPT_EXP: /* Fast Open option shares code 254 using a * 16 bits magic number. It's valid only in @@ -3915,8 +3890,7 @@ static bool tcp_parse_aligned_timestamp(struct tcp_sock *tp, const struct tcphdr * If it is wrong it falls back on tcp_parse_options(). */ static bool tcp_fast_parse_options(const struct sk_buff *skb, - const struct tcphdr *th, - struct tcp_sock *tp, const u8 **hvpp) + const struct tcphdr *th, struct tcp_sock *tp) { /* In the spirit of fast parsing, compare doff directly to constant * values. Because equality is used, short doff can be ignored here. @@ -3930,7 +3904,7 @@ static bool tcp_fast_parse_options(const struct sk_buff *skb, return true; } - tcp_parse_options(skb, &tp->rx_opt, hvpp, 1, NULL); + tcp_parse_options(skb, &tp->rx_opt, 1, NULL); if (tp->rx_opt.saw_tstamp) tp->rx_opt.rcv_tsecr -= tp->tsoffset; @@ -5311,12 +5285,10 @@ out: static bool tcp_validate_incoming(struct sock *sk, struct sk_buff *skb, const struct tcphdr *th, int syn_inerr) { - const u8 *hash_location; struct tcp_sock *tp = tcp_sk(sk); /* RFC1323: H1. Apply PAWS check first. */ - if (tcp_fast_parse_options(skb, th, tp, &hash_location) && - tp->rx_opt.saw_tstamp && + if (tcp_fast_parse_options(skb, th, tp) && tp->rx_opt.saw_tstamp && tcp_paws_discard(sk, skb)) { if (!th->rst) { NET_INC_STATS_BH(sock_net(sk), LINUX_MIB_PAWSESTABREJECTED); @@ -5670,12 +5642,11 @@ static bool tcp_rcv_fastopen_synack(struct sock *sk, struct sk_buff *synack, if (mss == tp->rx_opt.user_mss) { struct tcp_options_received opt; - const u8 *hash_location; /* Get original SYNACK MSS value if user MSS sets mss_clamp */ tcp_clear_options(&opt); opt.user_mss = opt.mss_clamp = 0; - tcp_parse_options(synack, &opt, &hash_location, 0, NULL); + tcp_parse_options(synack, &opt, 0, NULL); mss = opt.mss_clamp; } @@ -5706,14 +5677,12 @@ static bool tcp_rcv_fastopen_synack(struct sock *sk, struct sk_buff *synack, static int tcp_rcv_synsent_state_process(struct sock *sk, struct sk_buff *skb, const struct tcphdr *th, unsigned int len) { - const u8 *hash_location; struct inet_connection_sock *icsk = inet_csk(sk); struct tcp_sock *tp = tcp_sk(sk); - struct tcp_cookie_values *cvp = tp->cookie_values; struct tcp_fastopen_cookie foc = { .len = -1 }; int saved_clamp = tp->rx_opt.mss_clamp; - tcp_parse_options(skb, &tp->rx_opt, &hash_location, 0, &foc); + tcp_parse_options(skb, &tp->rx_opt, 0, &foc); if (tp->rx_opt.saw_tstamp) tp->rx_opt.rcv_tsecr -= tp->tsoffset; @@ -5810,30 +5779,6 @@ static int tcp_rcv_synsent_state_process(struct sock *sk, struct sk_buff *skb, * is initialized. */ tp->copied_seq = tp->rcv_nxt; - if (cvp != NULL && - cvp->cookie_pair_size > 0 && - tp->rx_opt.cookie_plus > 0) { - int cookie_size = tp->rx_opt.cookie_plus - - TCPOLEN_COOKIE_BASE; - int cookie_pair_size = cookie_size - + cvp->cookie_desired; - - /* A cookie extension option was sent and returned. - * Note that each incoming SYNACK replaces the - * Responder cookie. The initial exchange is most - * fragile, as protection against spoofing relies - * entirely upon the sequence and timestamp (above). - * This replacement strategy allows the correct pair to - * pass through, while any others will be filtered via - * Responder verification later. - */ - if (sizeof(cvp->cookie_pair) >= cookie_pair_size) { - memcpy(&cvp->cookie_pair[cvp->cookie_desired], - hash_location, cookie_size); - cvp->cookie_pair_size = cookie_pair_size; - } - } - smp_mb(); tcp_finish_connect(sk, skb); diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c index b7ab868c8284..b27c758ca23f 100644 --- a/net/ipv4/tcp_ipv4.c +++ b/net/ipv4/tcp_ipv4.c @@ -838,7 +838,6 @@ static void tcp_v4_reqsk_send_ack(struct sock *sk, struct sk_buff *skb, */ static int tcp_v4_send_synack(struct sock *sk, struct dst_entry *dst, struct request_sock *req, - struct request_values *rvp, u16 queue_mapping, bool nocache) { @@ -851,7 +850,7 @@ static int tcp_v4_send_synack(struct sock *sk, struct dst_entry *dst, if (!dst && (dst = inet_csk_route_req(sk, &fl4, req)) == NULL) return -1; - skb = tcp_make_synack(sk, dst, req, rvp, NULL); + skb = tcp_make_synack(sk, dst, req, NULL); if (skb) { __tcp_v4_send_check(skb, ireq->loc_addr, ireq->rmt_addr); @@ -868,10 +867,9 @@ static int tcp_v4_send_synack(struct sock *sk, struct dst_entry *dst, return err; } -static int tcp_v4_rtx_synack(struct sock *sk, struct request_sock *req, - struct request_values *rvp) +static int tcp_v4_rtx_synack(struct sock *sk, struct request_sock *req) { - int res = tcp_v4_send_synack(sk, NULL, req, rvp, 0, false); + int res = tcp_v4_send_synack(sk, NULL, req, 0, false); if (!res) TCP_INC_STATS_BH(sock_net(sk), TCP_MIB_RETRANSSEGS); @@ -1371,8 +1369,7 @@ static bool tcp_fastopen_check(struct sock *sk, struct sk_buff *skb, static int tcp_v4_conn_req_fastopen(struct sock *sk, struct sk_buff *skb, struct sk_buff *skb_synack, - struct request_sock *req, - struct request_values *rvp) + struct request_sock *req) { struct tcp_sock *tp = tcp_sk(sk); struct request_sock_queue *queue = &inet_csk(sk)->icsk_accept_queue; @@ -1467,9 +1464,7 @@ static int tcp_v4_conn_req_fastopen(struct sock *sk, int tcp_v4_conn_request(struct sock *sk, struct sk_buff *skb) { - struct tcp_extend_values tmp_ext; struct tcp_options_received tmp_opt; - const u8 *hash_location; struct request_sock *req; struct inet_request_sock *ireq; struct tcp_sock *tp = tcp_sk(sk); @@ -1519,42 +1514,7 @@ int tcp_v4_conn_request(struct sock *sk, struct sk_buff *skb) tcp_clear_options(&tmp_opt); tmp_opt.mss_clamp = TCP_MSS_DEFAULT; tmp_opt.user_mss = tp->rx_opt.user_mss; - tcp_parse_options(skb, &tmp_opt, &hash_location, 0, - want_cookie ? NULL : &foc); - - if (tmp_opt.cookie_plus > 0 && - tmp_opt.saw_tstamp && - !tp->rx_opt.cookie_out_never && - (sysctl_tcp_cookie_size > 0 || - (tp->cookie_values != NULL && - tp->cookie_values->cookie_desired > 0))) { - u8 *c; - u32 *mess = &tmp_ext.cookie_bakery[COOKIE_DIGEST_WORDS]; - int l = tmp_opt.cookie_plus - TCPOLEN_COOKIE_BASE; - - if (tcp_cookie_generator(&tmp_ext.cookie_bakery[0]) != 0) - goto drop_and_release; - - /* Secret recipe starts with IP addresses */ - *mess++ ^= (__force u32)daddr; - *mess++ ^= (__force u32)saddr; - - /* plus variable length Initiator Cookie */ - c = (u8 *)mess; - while (l-- > 0) - *c++ ^= *hash_location++; - - want_cookie = false; /* not our kind of cookie */ - tmp_ext.cookie_out_never = 0; /* false */ - tmp_ext.cookie_plus = tmp_opt.cookie_plus; - } else if (!tp->rx_opt.cookie_in_always) { - /* redundant indications, but ensure initialization. */ - tmp_ext.cookie_out_never = 1; /* true */ - tmp_ext.cookie_plus = 0; - } else { - goto drop_and_release; - } - tmp_ext.cookie_in_always = tp->rx_opt.cookie_in_always; + tcp_parse_options(skb, &tmp_opt, 0, want_cookie ? NULL : &foc); if (want_cookie && !tmp_opt.saw_tstamp) tcp_clear_options(&tmp_opt); @@ -1636,7 +1596,6 @@ int tcp_v4_conn_request(struct sock *sk, struct sk_buff *skb) * of tcp_v4_send_synack()->tcp_select_initial_window(). */ skb_synack = tcp_make_synack(sk, dst, req, - (struct request_values *)&tmp_ext, fastopen_cookie_present(&valid_foc) ? &valid_foc : NULL); if (skb_synack) { @@ -1660,8 +1619,7 @@ int tcp_v4_conn_request(struct sock *sk, struct sk_buff *skb) if (fastopen_cookie_present(&foc) && foc.len != 0) NET_INC_STATS_BH(sock_net(sk), LINUX_MIB_TCPFASTOPENPASSIVEFAIL); - } else if (tcp_v4_conn_req_fastopen(sk, skb, skb_synack, req, - (struct request_values *)&tmp_ext)) + } else if (tcp_v4_conn_req_fastopen(sk, skb, skb_synack, req)) goto drop_and_free; return 0; @@ -2241,12 +2199,6 @@ void tcp_v4_destroy_sock(struct sock *sk) if (inet_csk(sk)->icsk_bind_hash) inet_put_port(sk); - /* TCP Cookie Transactions */ - if (tp->cookie_values != NULL) { - kref_put(&tp->cookie_values->kref, - tcp_cookie_values_release); - tp->cookie_values = NULL; - } BUG_ON(tp->fastopen_rsk != NULL); /* If socket is aborted during connect operation */ diff --git a/net/ipv4/tcp_minisocks.c b/net/ipv4/tcp_minisocks.c index 4bdb09fca401..8f0234f8bb95 100644 --- a/net/ipv4/tcp_minisocks.c +++ b/net/ipv4/tcp_minisocks.c @@ -93,13 +93,12 @@ tcp_timewait_state_process(struct inet_timewait_sock *tw, struct sk_buff *skb, const struct tcphdr *th) { struct tcp_options_received tmp_opt; - const u8 *hash_location; struct tcp_timewait_sock *tcptw = tcp_twsk((struct sock *)tw); bool paws_reject = false; tmp_opt.saw_tstamp = 0; if (th->doff > (sizeof(*th) >> 2) && tcptw->tw_ts_recent_stamp) { - tcp_parse_options(skb, &tmp_opt, &hash_location, 0, NULL); + tcp_parse_options(skb, &tmp_opt, 0, NULL); if (tmp_opt.saw_tstamp) { tmp_opt.rcv_tsecr -= tcptw->tw_ts_offset; @@ -388,32 +387,6 @@ struct sock *tcp_create_openreq_child(struct sock *sk, struct request_sock *req, struct tcp_request_sock *treq = tcp_rsk(req); struct inet_connection_sock *newicsk = inet_csk(newsk); struct tcp_sock *newtp = tcp_sk(newsk); - struct tcp_sock *oldtp = tcp_sk(sk); - struct tcp_cookie_values *oldcvp = oldtp->cookie_values; - - /* TCP Cookie Transactions require space for the cookie pair, - * as it differs for each connection. There is no need to - * copy any s_data_payload stored at the original socket. - * Failure will prevent resuming the connection. - * - * Presumed copied, in order of appearance: - * cookie_in_always, cookie_out_never - */ - if (oldcvp != NULL) { - struct tcp_cookie_values *newcvp = - kzalloc(sizeof(*newtp->cookie_values), - GFP_ATOMIC); - - if (newcvp != NULL) { - kref_init(&newcvp->kref); - newcvp->cookie_desired = - oldcvp->cookie_desired; - newtp->cookie_values = newcvp; - } else { - /* Not Yet Implemented */ - newtp->cookie_values = NULL; - } - } /* Now setup tcp_sock */ newtp->pred_flags = 0; @@ -422,8 +395,7 @@ struct sock *tcp_create_openreq_child(struct sock *sk, struct request_sock *req, newtp->rcv_nxt = treq->rcv_isn + 1; newtp->snd_sml = newtp->snd_una = - newtp->snd_nxt = newtp->snd_up = - treq->snt_isn + 1 + tcp_s_data_size(oldtp); + newtp->snd_nxt = newtp->snd_up = treq->snt_isn + 1; tcp_prequeue_init(newtp); INIT_LIST_HEAD(&newtp->tsq_node); @@ -460,8 +432,7 @@ struct sock *tcp_create_openreq_child(struct sock *sk, struct request_sock *req, tcp_set_ca_state(newsk, TCP_CA_Open); tcp_init_xmit_timers(newsk); skb_queue_head_init(&newtp->out_of_order_queue); - newtp->write_seq = newtp->pushed_seq = - treq->snt_isn + 1 + tcp_s_data_size(oldtp); + newtp->write_seq = newtp->pushed_seq = treq->snt_isn + 1; newtp->rx_opt.saw_tstamp = 0; @@ -538,7 +509,6 @@ struct sock *tcp_check_req(struct sock *sk, struct sk_buff *skb, bool fastopen) { struct tcp_options_received tmp_opt; - const u8 *hash_location; struct sock *child; const struct tcphdr *th = tcp_hdr(skb); __be32 flg = tcp_flag_word(th) & (TCP_FLAG_RST|TCP_FLAG_SYN|TCP_FLAG_ACK); @@ -548,7 +518,7 @@ struct sock *tcp_check_req(struct sock *sk, struct sk_buff *skb, tmp_opt.saw_tstamp = 0; if (th->doff > (sizeof(struct tcphdr)>>2)) { - tcp_parse_options(skb, &tmp_opt, &hash_location, 0, NULL); + tcp_parse_options(skb, &tmp_opt, 0, NULL); if (tmp_opt.saw_tstamp) { tmp_opt.ts_recent = req->ts_recent; @@ -648,7 +618,7 @@ struct sock *tcp_check_req(struct sock *sk, struct sk_buff *skb, */ if ((flg & TCP_FLAG_ACK) && !fastopen && (TCP_SKB_CB(skb)->ack_seq != - tcp_rsk(req)->snt_isn + 1 + tcp_s_data_size(tcp_sk(sk)))) + tcp_rsk(req)->snt_isn + 1)) return sk; /* Also, it would be not so bad idea to check rcv_tsecr, which diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c index 8e7742f0b5d2..ac5871ebe086 100644 --- a/net/ipv4/tcp_output.c +++ b/net/ipv4/tcp_output.c @@ -65,9 +65,6 @@ int sysctl_tcp_base_mss __read_mostly = TCP_BASE_MSS; /* By default, RFC2861 behavior. */ int sysctl_tcp_slow_start_after_idle __read_mostly = 1; -int sysctl_tcp_cookie_size __read_mostly = 0; /* TCP_COOKIE_MAX */ -EXPORT_SYMBOL_GPL(sysctl_tcp_cookie_size); - static bool tcp_write_xmit(struct sock *sk, unsigned int mss_now, int nonagle, int push_one, gfp_t gfp); @@ -386,7 +383,6 @@ static inline bool tcp_urg_mode(const struct tcp_sock *tp) #define OPTION_TS (1 << 1) #define OPTION_MD5 (1 << 2) #define OPTION_WSCALE (1 << 3) -#define OPTION_COOKIE_EXTENSION (1 << 4) #define OPTION_FAST_OPEN_COOKIE (1 << 8) struct tcp_out_options { @@ -400,36 +396,6 @@ struct tcp_out_options { struct tcp_fastopen_cookie *fastopen_cookie; /* Fast open cookie */ }; -/* The sysctl int routines are generic, so check consistency here. - */ -static u8 tcp_cookie_size_check(u8 desired) -{ - int cookie_size; - - if (desired > 0) - /* previously specified */ - return desired; - - cookie_size = ACCESS_ONCE(sysctl_tcp_cookie_size); - if (cookie_size <= 0) - /* no default specified */ - return 0; - - if (cookie_size <= TCP_COOKIE_MIN) - /* value too small, specify minimum */ - return TCP_COOKIE_MIN; - - if (cookie_size >= TCP_COOKIE_MAX) - /* value too large, specify maximum */ - return TCP_COOKIE_MAX; - - if (cookie_size & 1) - /* 8-bit multiple, illegal, fix it */ - cookie_size++; - - return (u8)cookie_size; -} - /* Write previously computed TCP options to the packet. * * Beware: Something in the Internet is very sensitive to the ordering of @@ -448,27 +414,9 @@ static void tcp_options_write(__be32 *ptr, struct tcp_sock *tp, { u16 options = opts->options; /* mungable copy */ - /* Having both authentication and cookies for security is redundant, - * and there's certainly not enough room. Instead, the cookie-less - * extension variant is proposed. - * - * Consider the pessimal case with authentication. The options - * could look like: - * COOKIE|MD5(20) + MSS(4) + SACK|TS(12) + WSCALE(4) == 40 - */ if (unlikely(OPTION_MD5 & options)) { - if (unlikely(OPTION_COOKIE_EXTENSION & options)) { - *ptr++ = htonl((TCPOPT_COOKIE << 24) | - (TCPOLEN_COOKIE_BASE << 16) | - (TCPOPT_MD5SIG << 8) | - TCPOLEN_MD5SIG); - } else { - *ptr++ = htonl((TCPOPT_NOP << 24) | - (TCPOPT_NOP << 16) | - (TCPOPT_MD5SIG << 8) | - TCPOLEN_MD5SIG); - } - options &= ~OPTION_COOKIE_EXTENSION; + *ptr++ = htonl((TCPOPT_NOP << 24) | (TCPOPT_NOP << 16) | + (TCPOPT_MD5SIG << 8) | TCPOLEN_MD5SIG); /* overload cookie hash location */ opts->hash_location = (__u8 *)ptr; ptr += 4; @@ -497,44 +445,6 @@ static void tcp_options_write(__be32 *ptr, struct tcp_sock *tp, *ptr++ = htonl(opts->tsecr); } - /* Specification requires after timestamp, so do it now. - * - * Consider the pessimal case without authentication. The options - * could look like: - * MSS(4) + SACK|TS(12) + COOKIE(20) + WSCALE(4) == 40 - */ - if (unlikely(OPTION_COOKIE_EXTENSION & options)) { - __u8 *cookie_copy = opts->hash_location; - u8 cookie_size = opts->hash_size; - - /* 8-bit multiple handled in tcp_cookie_size_check() above, - * and elsewhere. - */ - if (0x2 & cookie_size) { - __u8 *p = (__u8 *)ptr; - - /* 16-bit multiple */ - *p++ = TCPOPT_COOKIE; - *p++ = TCPOLEN_COOKIE_BASE + cookie_size; - *p++ = *cookie_copy++; - *p++ = *cookie_copy++; - ptr++; - cookie_size -= 2; - } else { - /* 32-bit multiple */ - *ptr++ = htonl(((TCPOPT_NOP << 24) | - (TCPOPT_NOP << 16) | - (TCPOPT_COOKIE << 8) | - TCPOLEN_COOKIE_BASE) + - cookie_size); - } - - if (cookie_size > 0) { - memcpy(ptr, cookie_copy, cookie_size); - ptr += (cookie_size / 4); - } - } - if (unlikely(OPTION_SACK_ADVERTISE & options)) { *ptr++ = htonl((TCPOPT_NOP << 24) | (TCPOPT_NOP << 16) | @@ -593,11 +503,7 @@ static unsigned int tcp_syn_options(struct sock *sk, struct sk_buff *skb, struct tcp_md5sig_key **md5) { struct tcp_sock *tp = tcp_sk(sk); - struct tcp_cookie_values *cvp = tp->cookie_values; unsigned int remaining = MAX_TCP_OPTION_SPACE; - u8 cookie_size = (!tp->rx_opt.cookie_out_never && cvp != NULL) ? - tcp_cookie_size_check(cvp->cookie_desired) : - 0; struct tcp_fastopen_request *fastopen = tp->fastopen_req; #ifdef CONFIG_TCP_MD5SIG @@ -649,52 +555,7 @@ static unsigned int tcp_syn_options(struct sock *sk, struct sk_buff *skb, tp->syn_fastopen = 1; } } - /* Note that timestamps are required by the specification. - * - * Odd numbers of bytes are prohibited by the specification, ensuring - * that the cookie is 16-bit aligned, and the resulting cookie pair is - * 32-bit aligned. - */ - if (*md5 == NULL && - (OPTION_TS & opts->options) && - cookie_size > 0) { - int need = TCPOLEN_COOKIE_BASE + cookie_size; - - if (0x2 & need) { - /* 32-bit multiple */ - need += 2; /* NOPs */ - - if (need > remaining) { - /* try shrinking cookie to fit */ - cookie_size -= 2; - need -= 4; - } - } - while (need > remaining && TCP_COOKIE_MIN <= cookie_size) { - cookie_size -= 4; - need -= 4; - } - if (TCP_COOKIE_MIN <= cookie_size) { - opts->options |= OPTION_COOKIE_EXTENSION; - opts->hash_location = (__u8 *)&cvp->cookie_pair[0]; - opts->hash_size = cookie_size; - - /* Remember for future incarnations. */ - cvp->cookie_desired = cookie_size; - - if (cvp->cookie_desired != cvp->cookie_pair_size) { - /* Currently use random bytes as a nonce, - * assuming these are completely unpredictable - * by hostile users of the same system. - */ - get_random_bytes(&cvp->cookie_pair[0], - cookie_size); - cvp->cookie_pair_size = cookie_size; - } - remaining -= need; - } - } return MAX_TCP_OPTION_SPACE - remaining; } @@ -704,14 +565,10 @@ static unsigned int tcp_synack_options(struct sock *sk, unsigned int mss, struct sk_buff *skb, struct tcp_out_options *opts, struct tcp_md5sig_key **md5, - struct tcp_extend_values *xvp, struct tcp_fastopen_cookie *foc) { struct inet_request_sock *ireq = inet_rsk(req); unsigned int remaining = MAX_TCP_OPTION_SPACE; - u8 cookie_plus = (xvp != NULL && !xvp->cookie_out_never) ? - xvp->cookie_plus : - 0; #ifdef CONFIG_TCP_MD5SIG *md5 = tcp_rsk(req)->af_specific->md5_lookup(sk, req); @@ -759,28 +616,7 @@ static unsigned int tcp_synack_options(struct sock *sk, remaining -= need; } } - /* Similar rationale to tcp_syn_options() applies here, too. - * If the options fit, the same options should fit now! - */ - if (*md5 == NULL && - ireq->tstamp_ok && - cookie_plus > TCPOLEN_COOKIE_BASE) { - int need = cookie_plus; /* has TCPOLEN_COOKIE_BASE */ - - if (0x2 & need) { - /* 32-bit multiple */ - need += 2; /* NOPs */ - } - if (need <= remaining) { - opts->options |= OPTION_COOKIE_EXTENSION; - opts->hash_size = cookie_plus - TCPOLEN_COOKIE_BASE; - remaining -= need; - } else { - /* There's no error return, so flag it. */ - xvp->cookie_out_never = 1; /* true */ - opts->hash_size = 0; - } - } + return MAX_TCP_OPTION_SPACE - remaining; } @@ -2802,32 +2638,24 @@ int tcp_send_synack(struct sock *sk) * sk: listener socket * dst: dst entry attached to the SYNACK * req: request_sock pointer - * rvp: request_values pointer * * Allocate one skb and build a SYNACK packet. * @dst is consumed : Caller should not use it again. */ struct sk_buff *tcp_make_synack(struct sock *sk, struct dst_entry *dst, struct request_sock *req, - struct request_values *rvp, struct tcp_fastopen_cookie *foc) { struct tcp_out_options opts; - struct tcp_extend_values *xvp = tcp_xv(rvp); struct inet_request_sock *ireq = inet_rsk(req); struct tcp_sock *tp = tcp_sk(sk); - const struct tcp_cookie_values *cvp = tp->cookie_values; struct tcphdr *th; struct sk_buff *skb; struct tcp_md5sig_key *md5; int tcp_header_size; int mss; - int s_data_desired = 0; - if (cvp != NULL && cvp->s_data_constant && cvp->s_data_desired) - s_data_desired = cvp->s_data_desired; - skb = alloc_skb(MAX_TCP_HEADER + 15 + s_data_desired, - sk_gfp_atomic(sk, GFP_ATOMIC)); + skb = alloc_skb(MAX_TCP_HEADER + 15, sk_gfp_atomic(sk, GFP_ATOMIC)); if (unlikely(!skb)) { dst_release(dst); return NULL; @@ -2869,9 +2697,8 @@ struct sk_buff *tcp_make_synack(struct sock *sk, struct dst_entry *dst, else #endif TCP_SKB_CB(skb)->when = tcp_time_stamp; - tcp_header_size = tcp_synack_options(sk, req, mss, - skb, &opts, &md5, xvp, foc) - + sizeof(*th); + tcp_header_size = tcp_synack_options(sk, req, mss, skb, &opts, &md5, + foc) + sizeof(*th); skb_push(skb, tcp_header_size); skb_reset_transport_header(skb); @@ -2889,40 +2716,6 @@ struct sk_buff *tcp_make_synack(struct sock *sk, struct dst_entry *dst, tcp_init_nondata_skb(skb, tcp_rsk(req)->snt_isn, TCPHDR_SYN | TCPHDR_ACK); - if (OPTION_COOKIE_EXTENSION & opts.options) { - if (s_data_desired) { - u8 *buf = skb_put(skb, s_data_desired); - - /* copy data directly from the listening socket. */ - memcpy(buf, cvp->s_data_payload, s_data_desired); - TCP_SKB_CB(skb)->end_seq += s_data_desired; - } - - if (opts.hash_size > 0) { - __u32 workspace[SHA_WORKSPACE_WORDS]; - u32 *mess = &xvp->cookie_bakery[COOKIE_DIGEST_WORDS]; - u32 *tail = &mess[COOKIE_MESSAGE_WORDS-1]; - - /* Secret recipe depends on the Timestamp, (future) - * Sequence and Acknowledgment Numbers, Initiator - * Cookie, and others handled by IP variant caller. - */ - *tail-- ^= opts.tsval; - *tail-- ^= tcp_rsk(req)->rcv_isn + 1; - *tail-- ^= TCP_SKB_CB(skb)->seq + 1; - - /* recommended */ - *tail-- ^= (((__force u32)th->dest << 16) | (__force u32)th->source); - *tail-- ^= (u32)(unsigned long)cvp; /* per sockopt */ - - sha_transform((__u32 *)&xvp->cookie_bakery[0], - (char *)mess, - &workspace[0]); - opts.hash_location = - (__u8 *)&xvp->cookie_bakery[0]; - } - } - th->seq = htonl(TCP_SKB_CB(skb)->seq); /* XXX data is queued and acked as is. No buffer/window check */ th->ack_seq = htonl(tcp_rsk(req)->rcv_nxt); diff --git a/net/ipv6/syncookies.c b/net/ipv6/syncookies.c index 8a0848b60b35..d5dda20bd717 100644 --- a/net/ipv6/syncookies.c +++ b/net/ipv6/syncookies.c @@ -149,7 +149,6 @@ static inline int cookie_check(const struct sk_buff *skb, __u32 cookie) struct sock *cookie_v6_check(struct sock *sk, struct sk_buff *skb) { struct tcp_options_received tcp_opt; - const u8 *hash_location; struct inet_request_sock *ireq; struct inet6_request_sock *ireq6; struct tcp_request_sock *treq; @@ -177,7 +176,7 @@ struct sock *cookie_v6_check(struct sock *sk, struct sk_buff *skb) /* check for timestamp cookie support */ memset(&tcp_opt, 0, sizeof(tcp_opt)); - tcp_parse_options(skb, &tcp_opt, &hash_location, 0, NULL); + tcp_parse_options(skb, &tcp_opt, 0, NULL); if (!cookie_check_timestamp(&tcp_opt, sock_net(sk), &ecn_ok)) goto out; diff --git a/net/ipv6/tcp_ipv6.c b/net/ipv6/tcp_ipv6.c index 9b6460055df5..0a97add2ab74 100644 --- a/net/ipv6/tcp_ipv6.c +++ b/net/ipv6/tcp_ipv6.c @@ -454,7 +454,6 @@ out: static int tcp_v6_send_synack(struct sock *sk, struct dst_entry *dst, struct flowi6 *fl6, struct request_sock *req, - struct request_values *rvp, u16 queue_mapping) { struct inet6_request_sock *treq = inet6_rsk(req); @@ -466,7 +465,7 @@ static int tcp_v6_send_synack(struct sock *sk, struct dst_entry *dst, if (!dst && (dst = inet6_csk_route_req(sk, fl6, req)) == NULL) goto done; - skb = tcp_make_synack(sk, dst, req, rvp, NULL); + skb = tcp_make_synack(sk, dst, req, NULL); if (skb) { __tcp_v6_send_check(skb, &treq->loc_addr, &treq->rmt_addr); @@ -481,13 +480,12 @@ done: return err; } -static int tcp_v6_rtx_synack(struct sock *sk, struct request_sock *req, - struct request_values *rvp) +static int tcp_v6_rtx_synack(struct sock *sk, struct request_sock *req) { struct flowi6 fl6; int res; - res = tcp_v6_send_synack(sk, NULL, &fl6, req, rvp, 0); + res = tcp_v6_send_synack(sk, NULL, &fl6, req, 0); if (!res) TCP_INC_STATS_BH(sock_net(sk), TCP_MIB_RETRANSSEGS); return res; @@ -940,9 +938,7 @@ static struct sock *tcp_v6_hnd_req(struct sock *sk,struct sk_buff *skb) */ static int tcp_v6_conn_request(struct sock *sk, struct sk_buff *skb) { - struct tcp_extend_values tmp_ext; struct tcp_options_received tmp_opt; - const u8 *hash_location; struct request_sock *req; struct inet6_request_sock *treq; struct ipv6_pinfo *np = inet6_sk(sk); @@ -980,50 +976,7 @@ static int tcp_v6_conn_request(struct sock *sk, struct sk_buff *skb) tcp_clear_options(&tmp_opt); tmp_opt.mss_clamp = IPV6_MIN_MTU - sizeof(struct tcphdr) - sizeof(struct ipv6hdr); tmp_opt.user_mss = tp->rx_opt.user_mss; - tcp_parse_options(skb, &tmp_opt, &hash_location, 0, NULL); - - if (tmp_opt.cookie_plus > 0 && - tmp_opt.saw_tstamp && - !tp->rx_opt.cookie_out_never && - (sysctl_tcp_cookie_size > 0 || - (tp->cookie_values != NULL && - tp->cookie_values->cookie_desired > 0))) { - u8 *c; - u32 *d; - u32 *mess = &tmp_ext.cookie_bakery[COOKIE_DIGEST_WORDS]; - int l = tmp_opt.cookie_plus - TCPOLEN_COOKIE_BASE; - - if (tcp_cookie_generator(&tmp_ext.cookie_bakery[0]) != 0) - goto drop_and_free; - - /* Secret recipe starts with IP addresses */ - d = (__force u32 *)&ipv6_hdr(skb)->daddr.s6_addr32[0]; - *mess++ ^= *d++; - *mess++ ^= *d++; - *mess++ ^= *d++; - *mess++ ^= *d++; - d = (__force u32 *)&ipv6_hdr(skb)->saddr.s6_addr32[0]; - *mess++ ^= *d++; - *mess++ ^= *d++; - *mess++ ^= *d++; - *mess++ ^= *d++; - - /* plus variable length Initiator Cookie */ - c = (u8 *)mess; - while (l-- > 0) - *c++ ^= *hash_location++; - - want_cookie = false; /* not our kind of cookie */ - tmp_ext.cookie_out_never = 0; /* false */ - tmp_ext.cookie_plus = tmp_opt.cookie_plus; - } else if (!tp->rx_opt.cookie_in_always) { - /* redundant indications, but ensure initialization. */ - tmp_ext.cookie_out_never = 1; /* true */ - tmp_ext.cookie_plus = 0; - } else { - goto drop_and_free; - } - tmp_ext.cookie_in_always = tp->rx_opt.cookie_in_always; + tcp_parse_options(skb, &tmp_opt, 0, NULL); if (want_cookie && !tmp_opt.saw_tstamp) tcp_clear_options(&tmp_opt); @@ -1101,7 +1054,6 @@ have_isn: goto drop_and_release; if (tcp_v6_send_synack(sk, dst, &fl6, req, - (struct request_values *)&tmp_ext, skb_get_queue_mapping(skb)) || want_cookie) goto drop_and_free; -- cgit v1.2.3 From 9b44190dc114c1720b34975b5bfc65aece112ced Mon Sep 17 00:00:00 2001 From: Yuchung Cheng Date: Wed, 20 Mar 2013 13:32:58 +0000 Subject: tcp: refactor F-RTO The patch series refactor the F-RTO feature (RFC4138/5682). This is to simplify the loss recovery processing. Existing F-RTO was developed during the experimental stage (RFC4138) and has many experimental features. It takes a separate code path from the traditional timeout processing by overloading CA_Disorder instead of using CA_Loss state. This complicates CA_Disorder state handling because it's also used for handling dubious ACKs and undos. While the algorithm in the RFC does not change the congestion control, the implementation intercepts congestion control in various places (e.g., frto_cwnd in tcp_ack()). The new code implements newer F-RTO RFC5682 using CA_Loss processing path. F-RTO becomes a small extension in the timeout processing and interfaces with congestion control and Eifel undo modules. It lets congestion control (module) determines how many to send independently. F-RTO only chooses what to send in order to detect spurious retranmission. If timeout is found spurious it invokes existing Eifel undo algorithms like DSACK or TCP timestamp based detection. The first patch removes all F-RTO code except the sysctl_tcp_frto is left for the new implementation. Since CA_EVENT_FRTO is removed, TCP westwood now computes ssthresh on regular timeout CA_EVENT_LOSS event. Signed-off-by: Yuchung Cheng Acked-by: Neal Cardwell Acked-by: Eric Dumazet Signed-off-by: David S. Miller --- Documentation/networking/ip-sysctl.txt | 17 -- include/linux/tcp.h | 6 +- include/net/tcp.h | 4 - net/ipv4/sysctl_net_ipv4.c | 7 - net/ipv4/tcp_input.c | 375 +-------------------------------- net/ipv4/tcp_minisocks.c | 3 - net/ipv4/tcp_output.c | 11 +- net/ipv4/tcp_timer.c | 6 +- net/ipv4/tcp_westwood.c | 2 +- 9 files changed, 10 insertions(+), 421 deletions(-) (limited to 'Documentation') diff --git a/Documentation/networking/ip-sysctl.txt b/Documentation/networking/ip-sysctl.txt index 17953e2bc3e9..8a977a0aaede 100644 --- a/Documentation/networking/ip-sysctl.txt +++ b/Documentation/networking/ip-sysctl.txt @@ -239,23 +239,6 @@ tcp_frto - INTEGER interacts badly with the packet counting of the SACK enabled TCP flow. -tcp_frto_response - INTEGER - When F-RTO has detected that a TCP retransmission timeout was - spurious (i.e, the timeout would have been avoided had TCP set a - longer retransmission timeout), TCP has several options what to do - next. Possible values are: - 0 Rate halving based; a smooth and conservative response, - results in halved cwnd and ssthresh after one RTT - 1 Very conservative response; not recommended because even - though being valid, it interacts poorly with the rest of - Linux TCP, halves cwnd and ssthresh immediately - 2 Aggressive response; undoes congestion control measures - that are now known to be unnecessary (ignoring the - possibility of a lost retransmission that would require - TCP to be more cautious), cwnd and ssthresh are restored - to the values prior timeout - Default: 0 (rate halving based) - tcp_keepalive_time - INTEGER How often TCP sends out keepalive messages when keepalive is enabled. Default: 2hours. diff --git a/include/linux/tcp.h b/include/linux/tcp.h index ed6a7456eecd..f5f203b36379 100644 --- a/include/linux/tcp.h +++ b/include/linux/tcp.h @@ -187,14 +187,12 @@ struct tcp_sock { u32 window_clamp; /* Maximal window to advertise */ u32 rcv_ssthresh; /* Current window clamp */ - u32 frto_highmark; /* snd_nxt when RTO occurred */ u16 advmss; /* Advertised MSS */ - u8 frto_counter; /* Number of new acks after RTO */ + u8 unused; u8 nonagle : 4,/* Disable Nagle algorithm? */ thin_lto : 1,/* Use linear timeouts for thin streams */ thin_dupack : 1,/* Fast retransmit on first dupack */ - repair : 1, - unused : 1; + repair : 1; u8 repair_queue; u8 do_early_retrans:1,/* Enable RFC5827 early-retransmit */ syn_data:1, /* SYN includes data */ diff --git a/include/net/tcp.h b/include/net/tcp.h index 7f2f17198d75..d1dcb596230e 100644 --- a/include/net/tcp.h +++ b/include/net/tcp.h @@ -272,7 +272,6 @@ extern int sysctl_tcp_app_win; extern int sysctl_tcp_adv_win_scale; extern int sysctl_tcp_tw_reuse; extern int sysctl_tcp_frto; -extern int sysctl_tcp_frto_response; extern int sysctl_tcp_low_latency; extern int sysctl_tcp_dma_copybreak; extern int sysctl_tcp_nometrics_save; @@ -424,8 +423,6 @@ extern struct sock * tcp_check_req(struct sock *sk,struct sk_buff *skb, bool fastopen); extern int tcp_child_process(struct sock *parent, struct sock *child, struct sk_buff *skb); -extern bool tcp_use_frto(struct sock *sk); -extern void tcp_enter_frto(struct sock *sk); extern void tcp_enter_loss(struct sock *sk, int how); extern void tcp_clear_retrans(struct tcp_sock *tp); extern void tcp_update_metrics(struct sock *sk); @@ -756,7 +753,6 @@ enum tcp_ca_event { CA_EVENT_TX_START, /* first transmit when no packets in flight */ CA_EVENT_CWND_RESTART, /* congestion window restart */ CA_EVENT_COMPLETE_CWR, /* end of congestion recovery */ - CA_EVENT_FRTO, /* fast recovery timeout */ CA_EVENT_LOSS, /* loss timeout */ CA_EVENT_FAST_ACK, /* in sequence ack */ CA_EVENT_SLOW_ACK, /* other ack */ diff --git a/net/ipv4/sysctl_net_ipv4.c b/net/ipv4/sysctl_net_ipv4.c index cb45062c8be0..fa2f63fc453b 100644 --- a/net/ipv4/sysctl_net_ipv4.c +++ b/net/ipv4/sysctl_net_ipv4.c @@ -591,13 +591,6 @@ static struct ctl_table ipv4_table[] = { .mode = 0644, .proc_handler = proc_dointvec }, - { - .procname = "tcp_frto_response", - .data = &sysctl_tcp_frto_response, - .maxlen = sizeof(int), - .mode = 0644, - .proc_handler = proc_dointvec - }, { .procname = "tcp_low_latency", .data = &sysctl_tcp_low_latency, diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c index 19f0149fb6a2..231c79fe91f3 100644 --- a/net/ipv4/tcp_input.c +++ b/net/ipv4/tcp_input.c @@ -93,7 +93,6 @@ int sysctl_tcp_stdurg __read_mostly; int sysctl_tcp_rfc1337 __read_mostly; int sysctl_tcp_max_orphans __read_mostly = NR_FILE; int sysctl_tcp_frto __read_mostly = 2; -int sysctl_tcp_frto_response __read_mostly; int sysctl_tcp_thin_dupack __read_mostly; @@ -108,17 +107,14 @@ int sysctl_tcp_early_retrans __read_mostly = 3; #define FLAG_DATA_SACKED 0x20 /* New SACK. */ #define FLAG_ECE 0x40 /* ECE in this ACK */ #define FLAG_SLOWPATH 0x100 /* Do not skip RFC checks for window update.*/ -#define FLAG_ONLY_ORIG_SACKED 0x200 /* SACKs only non-rexmit sent before RTO */ #define FLAG_SND_UNA_ADVANCED 0x400 /* Snd_una was changed (!= FLAG_DATA_ACKED) */ #define FLAG_DSACKING_ACK 0x800 /* SACK blocks contained D-SACK info */ -#define FLAG_NONHEAD_RETRANS_ACKED 0x1000 /* Non-head rexmitted data was ACKed */ #define FLAG_SACK_RENEGING 0x2000 /* snd_una advanced to a sacked seq */ #define FLAG_ACKED (FLAG_DATA_ACKED|FLAG_SYN_ACKED) #define FLAG_NOT_DUP (FLAG_DATA|FLAG_WIN_UPDATE|FLAG_ACKED) #define FLAG_CA_ALERT (FLAG_DATA_SACKED|FLAG_ECE) #define FLAG_FORWARD_PROGRESS (FLAG_ACKED|FLAG_DATA_SACKED) -#define FLAG_ANY_PROGRESS (FLAG_FORWARD_PROGRESS|FLAG_SND_UNA_ADVANCED) #define TCP_REMNANT (TCP_FLAG_FIN|TCP_FLAG_URG|TCP_FLAG_SYN|TCP_FLAG_PSH) #define TCP_HP_BITS (~(TCP_RESERVED_BITS|TCP_FLAG_PSH)) @@ -1159,10 +1155,6 @@ static u8 tcp_sacktag_one(struct sock *sk, tcp_highest_sack_seq(tp))) state->reord = min(fack_count, state->reord); - - /* SACK enhanced F-RTO (RFC4138; Appendix B) */ - if (!after(end_seq, tp->frto_highmark)) - state->flag |= FLAG_ONLY_ORIG_SACKED; } if (sacked & TCPCB_LOST) { @@ -1555,7 +1547,6 @@ static int tcp_sacktag_write_queue(struct sock *sk, const struct sk_buff *ack_skb, u32 prior_snd_una) { - const struct inet_connection_sock *icsk = inet_csk(sk); struct tcp_sock *tp = tcp_sk(sk); const unsigned char *ptr = (skb_transport_header(ack_skb) + TCP_SKB_CB(ack_skb)->sacked); @@ -1728,12 +1719,6 @@ walk: start_seq, end_seq, dup_sack); advance_sp: - /* SACK enhanced FRTO (RFC4138, Appendix B): Clearing correct - * due to in-order walk - */ - if (after(end_seq, tp->frto_highmark)) - state.flag &= ~FLAG_ONLY_ORIG_SACKED; - i++; } @@ -1750,8 +1735,7 @@ advance_sp: tcp_verify_left_out(tp); if ((state.reord < tp->fackets_out) && - ((icsk->icsk_ca_state != TCP_CA_Loss) || tp->undo_marker) && - (!tp->frto_highmark || after(tp->snd_una, tp->frto_highmark))) + ((inet_csk(sk)->icsk_ca_state != TCP_CA_Loss) || tp->undo_marker)) tcp_update_reordering(sk, tp->fackets_out - state.reord, 0); out: @@ -1825,197 +1809,6 @@ static inline void tcp_reset_reno_sack(struct tcp_sock *tp) tp->sacked_out = 0; } -static int tcp_is_sackfrto(const struct tcp_sock *tp) -{ - return (sysctl_tcp_frto == 0x2) && !tcp_is_reno(tp); -} - -/* F-RTO can only be used if TCP has never retransmitted anything other than - * head (SACK enhanced variant from Appendix B of RFC4138 is more robust here) - */ -bool tcp_use_frto(struct sock *sk) -{ - const struct tcp_sock *tp = tcp_sk(sk); - const struct inet_connection_sock *icsk = inet_csk(sk); - struct sk_buff *skb; - - if (!sysctl_tcp_frto) - return false; - - /* MTU probe and F-RTO won't really play nicely along currently */ - if (icsk->icsk_mtup.probe_size) - return false; - - if (tcp_is_sackfrto(tp)) - return true; - - /* Avoid expensive walking of rexmit queue if possible */ - if (tp->retrans_out > 1) - return false; - - skb = tcp_write_queue_head(sk); - if (tcp_skb_is_last(sk, skb)) - return true; - skb = tcp_write_queue_next(sk, skb); /* Skips head */ - tcp_for_write_queue_from(skb, sk) { - if (skb == tcp_send_head(sk)) - break; - if (TCP_SKB_CB(skb)->sacked & TCPCB_RETRANS) - return false; - /* Short-circuit when first non-SACKed skb has been checked */ - if (!(TCP_SKB_CB(skb)->sacked & TCPCB_SACKED_ACKED)) - break; - } - return true; -} - -/* RTO occurred, but do not yet enter Loss state. Instead, defer RTO - * recovery a bit and use heuristics in tcp_process_frto() to detect if - * the RTO was spurious. Only clear SACKED_RETRANS of the head here to - * keep retrans_out counting accurate (with SACK F-RTO, other than head - * may still have that bit set); TCPCB_LOST and remaining SACKED_RETRANS - * bits are handled if the Loss state is really to be entered (in - * tcp_enter_frto_loss). - * - * Do like tcp_enter_loss() would; when RTO expires the second time it - * does: - * "Reduce ssthresh if it has not yet been made inside this window." - */ -void tcp_enter_frto(struct sock *sk) -{ - const struct inet_connection_sock *icsk = inet_csk(sk); - struct tcp_sock *tp = tcp_sk(sk); - struct sk_buff *skb; - - if ((!tp->frto_counter && icsk->icsk_ca_state <= TCP_CA_Disorder) || - tp->snd_una == tp->high_seq || - ((icsk->icsk_ca_state == TCP_CA_Loss || tp->frto_counter) && - !icsk->icsk_retransmits)) { - tp->prior_ssthresh = tcp_current_ssthresh(sk); - /* Our state is too optimistic in ssthresh() call because cwnd - * is not reduced until tcp_enter_frto_loss() when previous F-RTO - * recovery has not yet completed. Pattern would be this: RTO, - * Cumulative ACK, RTO (2xRTO for the same segment does not end - * up here twice). - * RFC4138 should be more specific on what to do, even though - * RTO is quite unlikely to occur after the first Cumulative ACK - * due to back-off and complexity of triggering events ... - */ - if (tp->frto_counter) { - u32 stored_cwnd; - stored_cwnd = tp->snd_cwnd; - tp->snd_cwnd = 2; - tp->snd_ssthresh = icsk->icsk_ca_ops->ssthresh(sk); - tp->snd_cwnd = stored_cwnd; - } else { - tp->snd_ssthresh = icsk->icsk_ca_ops->ssthresh(sk); - } - /* ... in theory, cong.control module could do "any tricks" in - * ssthresh(), which means that ca_state, lost bits and lost_out - * counter would have to be faked before the call occurs. We - * consider that too expensive, unlikely and hacky, so modules - * using these in ssthresh() must deal these incompatibility - * issues if they receives CA_EVENT_FRTO and frto_counter != 0 - */ - tcp_ca_event(sk, CA_EVENT_FRTO); - } - - tp->undo_marker = tp->snd_una; - tp->undo_retrans = 0; - - skb = tcp_write_queue_head(sk); - if (TCP_SKB_CB(skb)->sacked & TCPCB_RETRANS) - tp->undo_marker = 0; - if (TCP_SKB_CB(skb)->sacked & TCPCB_SACKED_RETRANS) { - TCP_SKB_CB(skb)->sacked &= ~TCPCB_SACKED_RETRANS; - tp->retrans_out -= tcp_skb_pcount(skb); - } - tcp_verify_left_out(tp); - - /* Too bad if TCP was application limited */ - tp->snd_cwnd = min(tp->snd_cwnd, tcp_packets_in_flight(tp) + 1); - - /* Earlier loss recovery underway (see RFC4138; Appendix B). - * The last condition is necessary at least in tp->frto_counter case. - */ - if (tcp_is_sackfrto(tp) && (tp->frto_counter || - ((1 << icsk->icsk_ca_state) & (TCPF_CA_Recovery|TCPF_CA_Loss))) && - after(tp->high_seq, tp->snd_una)) { - tp->frto_highmark = tp->high_seq; - } else { - tp->frto_highmark = tp->snd_nxt; - } - tcp_set_ca_state(sk, TCP_CA_Disorder); - tp->high_seq = tp->snd_nxt; - tp->frto_counter = 1; -} - -/* Enter Loss state after F-RTO was applied. Dupack arrived after RTO, - * which indicates that we should follow the traditional RTO recovery, - * i.e. mark everything lost and do go-back-N retransmission. - */ -static void tcp_enter_frto_loss(struct sock *sk, int allowed_segments, int flag) -{ - struct tcp_sock *tp = tcp_sk(sk); - struct sk_buff *skb; - - tp->lost_out = 0; - tp->retrans_out = 0; - if (tcp_is_reno(tp)) - tcp_reset_reno_sack(tp); - - tcp_for_write_queue(skb, sk) { - if (skb == tcp_send_head(sk)) - break; - - TCP_SKB_CB(skb)->sacked &= ~TCPCB_LOST; - /* - * Count the retransmission made on RTO correctly (only when - * waiting for the first ACK and did not get it)... - */ - if ((tp->frto_counter == 1) && !(flag & FLAG_DATA_ACKED)) { - /* For some reason this R-bit might get cleared? */ - if (TCP_SKB_CB(skb)->sacked & TCPCB_SACKED_RETRANS) - tp->retrans_out += tcp_skb_pcount(skb); - /* ...enter this if branch just for the first segment */ - flag |= FLAG_DATA_ACKED; - } else { - if (TCP_SKB_CB(skb)->sacked & TCPCB_RETRANS) - tp->undo_marker = 0; - TCP_SKB_CB(skb)->sacked &= ~TCPCB_SACKED_RETRANS; - } - - /* Marking forward transmissions that were made after RTO lost - * can cause unnecessary retransmissions in some scenarios, - * SACK blocks will mitigate that in some but not in all cases. - * We used to not mark them but it was causing break-ups with - * receivers that do only in-order receival. - * - * TODO: we could detect presence of such receiver and select - * different behavior per flow. - */ - if (!(TCP_SKB_CB(skb)->sacked & TCPCB_SACKED_ACKED)) { - TCP_SKB_CB(skb)->sacked |= TCPCB_LOST; - tp->lost_out += tcp_skb_pcount(skb); - tp->retransmit_high = TCP_SKB_CB(skb)->end_seq; - } - } - tcp_verify_left_out(tp); - - tp->snd_cwnd = tcp_packets_in_flight(tp) + allowed_segments; - tp->snd_cwnd_cnt = 0; - tp->snd_cwnd_stamp = tcp_time_stamp; - tp->frto_counter = 0; - - tp->reordering = min_t(unsigned int, tp->reordering, - sysctl_tcp_reordering); - tcp_set_ca_state(sk, TCP_CA_Loss); - tp->high_seq = tp->snd_nxt; - TCP_ECN_queue_cwr(tp); - - tcp_clear_all_retrans_hints(tp); -} - static void tcp_clear_retrans_partial(struct tcp_sock *tp) { tp->retrans_out = 0; @@ -2090,8 +1883,6 @@ void tcp_enter_loss(struct sock *sk, int how) tcp_set_ca_state(sk, TCP_CA_Loss); tp->high_seq = tp->snd_nxt; TCP_ECN_queue_cwr(tp); - /* Abort F-RTO algorithm if one is in progress */ - tp->frto_counter = 0; } /* If ACK arrived pointing to a remembered SACK, it means that our @@ -2275,10 +2066,6 @@ static bool tcp_time_to_recover(struct sock *sk, int flag) struct tcp_sock *tp = tcp_sk(sk); __u32 packets_out; - /* Do not perform any recovery during F-RTO algorithm */ - if (tp->frto_counter) - return false; - /* Trick#1: The loss is proven. */ if (tp->lost_out) return true; @@ -2760,7 +2547,7 @@ static void tcp_try_to_open(struct sock *sk, int flag, int newly_acked_sacked) tcp_verify_left_out(tp); - if (!tp->frto_counter && !tcp_any_retrans_done(sk)) + if (!tcp_any_retrans_done(sk)) tp->retrans_stamp = 0; if (flag & FLAG_ECE) @@ -3198,8 +2985,6 @@ static int tcp_clean_rtx_queue(struct sock *sk, int prior_fackets, flag |= FLAG_RETRANS_DATA_ACKED; ca_seq_rtt = -1; seq_rtt = -1; - if ((flag & FLAG_DATA_ACKED) || (acked_pcount > 1)) - flag |= FLAG_NONHEAD_RETRANS_ACKED; } else { ca_seq_rtt = now - scb->when; last_ackt = skb->tstamp; @@ -3408,150 +3193,6 @@ static int tcp_ack_update_window(struct sock *sk, const struct sk_buff *skb, u32 return flag; } -/* A very conservative spurious RTO response algorithm: reduce cwnd and - * continue in congestion avoidance. - */ -static void tcp_conservative_spur_to_response(struct tcp_sock *tp) -{ - tp->snd_cwnd = min(tp->snd_cwnd, tp->snd_ssthresh); - tp->snd_cwnd_cnt = 0; - TCP_ECN_queue_cwr(tp); - tcp_moderate_cwnd(tp); -} - -/* A conservative spurious RTO response algorithm: reduce cwnd using - * PRR and continue in congestion avoidance. - */ -static void tcp_cwr_spur_to_response(struct sock *sk) -{ - tcp_enter_cwr(sk, 0); -} - -static void tcp_undo_spur_to_response(struct sock *sk, int flag) -{ - if (flag & FLAG_ECE) - tcp_cwr_spur_to_response(sk); - else - tcp_undo_cwr(sk, true); -} - -/* F-RTO spurious RTO detection algorithm (RFC4138) - * - * F-RTO affects during two new ACKs following RTO (well, almost, see inline - * comments). State (ACK number) is kept in frto_counter. When ACK advances - * window (but not to or beyond highest sequence sent before RTO): - * On First ACK, send two new segments out. - * On Second ACK, RTO was likely spurious. Do spurious response (response - * algorithm is not part of the F-RTO detection algorithm - * given in RFC4138 but can be selected separately). - * Otherwise (basically on duplicate ACK), RTO was (likely) caused by a loss - * and TCP falls back to conventional RTO recovery. F-RTO allows overriding - * of Nagle, this is done using frto_counter states 2 and 3, when a new data - * segment of any size sent during F-RTO, state 2 is upgraded to 3. - * - * Rationale: if the RTO was spurious, new ACKs should arrive from the - * original window even after we transmit two new data segments. - * - * SACK version: - * on first step, wait until first cumulative ACK arrives, then move to - * the second step. In second step, the next ACK decides. - * - * F-RTO is implemented (mainly) in four functions: - * - tcp_use_frto() is used to determine if TCP is can use F-RTO - * - tcp_enter_frto() prepares TCP state on RTO if F-RTO is used, it is - * called when tcp_use_frto() showed green light - * - tcp_process_frto() handles incoming ACKs during F-RTO algorithm - * - tcp_enter_frto_loss() is called if there is not enough evidence - * to prove that the RTO is indeed spurious. It transfers the control - * from F-RTO to the conventional RTO recovery - */ -static bool tcp_process_frto(struct sock *sk, int flag) -{ - struct tcp_sock *tp = tcp_sk(sk); - - tcp_verify_left_out(tp); - - /* Duplicate the behavior from Loss state (fastretrans_alert) */ - if (flag & FLAG_DATA_ACKED) - inet_csk(sk)->icsk_retransmits = 0; - - if ((flag & FLAG_NONHEAD_RETRANS_ACKED) || - ((tp->frto_counter >= 2) && (flag & FLAG_RETRANS_DATA_ACKED))) - tp->undo_marker = 0; - - if (!before(tp->snd_una, tp->frto_highmark)) { - tcp_enter_frto_loss(sk, (tp->frto_counter == 1 ? 2 : 3), flag); - return true; - } - - if (!tcp_is_sackfrto(tp)) { - /* RFC4138 shortcoming in step 2; should also have case c): - * ACK isn't duplicate nor advances window, e.g., opposite dir - * data, winupdate - */ - if (!(flag & FLAG_ANY_PROGRESS) && (flag & FLAG_NOT_DUP)) - return true; - - if (!(flag & FLAG_DATA_ACKED)) { - tcp_enter_frto_loss(sk, (tp->frto_counter == 1 ? 0 : 3), - flag); - return true; - } - } else { - if (!(flag & FLAG_DATA_ACKED) && (tp->frto_counter == 1)) { - if (!tcp_packets_in_flight(tp)) { - tcp_enter_frto_loss(sk, 2, flag); - return true; - } - - /* Prevent sending of new data. */ - tp->snd_cwnd = min(tp->snd_cwnd, - tcp_packets_in_flight(tp)); - return true; - } - - if ((tp->frto_counter >= 2) && - (!(flag & FLAG_FORWARD_PROGRESS) || - ((flag & FLAG_DATA_SACKED) && - !(flag & FLAG_ONLY_ORIG_SACKED)))) { - /* RFC4138 shortcoming (see comment above) */ - if (!(flag & FLAG_FORWARD_PROGRESS) && - (flag & FLAG_NOT_DUP)) - return true; - - tcp_enter_frto_loss(sk, 3, flag); - return true; - } - } - - if (tp->frto_counter == 1) { - /* tcp_may_send_now needs to see updated state */ - tp->snd_cwnd = tcp_packets_in_flight(tp) + 2; - tp->frto_counter = 2; - - if (!tcp_may_send_now(sk)) - tcp_enter_frto_loss(sk, 2, flag); - - return true; - } else { - switch (sysctl_tcp_frto_response) { - case 2: - tcp_undo_spur_to_response(sk, flag); - break; - case 1: - tcp_conservative_spur_to_response(tp); - break; - default: - tcp_cwr_spur_to_response(sk); - break; - } - tp->frto_counter = 0; - tp->undo_marker = 0; - NET_INC_STATS_BH(sock_net(sk), LINUX_MIB_TCPSPURIOUSRTOS); - } - return false; -} - /* RFC 5961 7 [ACK Throttling] */ static void tcp_send_challenge_ack(struct sock *sk) { @@ -3616,7 +3257,6 @@ static int tcp_ack(struct sock *sk, const struct sk_buff *skb, int flag) int prior_packets; int prior_sacked = tp->sacked_out; int pkts_acked = 0; - bool frto_cwnd = false; /* If the ack is older than previous acks * then we can probably ignore it. @@ -3690,22 +3330,15 @@ static int tcp_ack(struct sock *sk, const struct sk_buff *skb, int flag) pkts_acked = prior_packets - tp->packets_out; - if (tp->frto_counter) - frto_cwnd = tcp_process_frto(sk, flag); - /* Guarantee sacktag reordering detection against wrap-arounds */ - if (before(tp->frto_highmark, tp->snd_una)) - tp->frto_highmark = 0; - if (tcp_ack_is_dubious(sk, flag)) { /* Advance CWND, if state allows this. */ - if ((flag & FLAG_DATA_ACKED) && !frto_cwnd && - tcp_may_raise_cwnd(sk, flag)) + if ((flag & FLAG_DATA_ACKED) && tcp_may_raise_cwnd(sk, flag)) tcp_cong_avoid(sk, ack, prior_in_flight); is_dupack = !(flag & (FLAG_SND_UNA_ADVANCED | FLAG_NOT_DUP)); tcp_fastretrans_alert(sk, pkts_acked, prior_sacked, is_dupack, flag); } else { - if ((flag & FLAG_DATA_ACKED) && !frto_cwnd) + if (flag & FLAG_DATA_ACKED) tcp_cong_avoid(sk, ack, prior_in_flight); } diff --git a/net/ipv4/tcp_minisocks.c b/net/ipv4/tcp_minisocks.c index 8f0234f8bb95..05eaf8904613 100644 --- a/net/ipv4/tcp_minisocks.c +++ b/net/ipv4/tcp_minisocks.c @@ -422,9 +422,6 @@ struct sock *tcp_create_openreq_child(struct sock *sk, struct request_sock *req, newtp->snd_cwnd = TCP_INIT_CWND; newtp->snd_cwnd_cnt = 0; - newtp->frto_counter = 0; - newtp->frto_highmark = 0; - if (newicsk->icsk_ca_ops != &tcp_init_congestion_ops && !try_module_get(newicsk->icsk_ca_ops->owner)) newicsk->icsk_ca_ops = &tcp_init_congestion_ops; diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c index e787ecec505e..163cf5fc0119 100644 --- a/net/ipv4/tcp_output.c +++ b/net/ipv4/tcp_output.c @@ -78,10 +78,6 @@ static void tcp_event_new_data_sent(struct sock *sk, const struct sk_buff *skb) tcp_advance_send_head(sk, skb); tp->snd_nxt = TCP_SKB_CB(skb)->end_seq; - /* Don't override Nagle indefinitely with F-RTO */ - if (tp->frto_counter == 2) - tp->frto_counter = 3; - tp->packets_out += tcp_skb_pcount(skb); if (!prior_packets || icsk->icsk_pending == ICSK_TIME_EARLY_RETRANS || icsk->icsk_pending == ICSK_TIME_LOSS_PROBE) @@ -1470,11 +1466,8 @@ static inline bool tcp_nagle_test(const struct tcp_sock *tp, const struct sk_buf if (nonagle & TCP_NAGLE_PUSH) return true; - /* Don't use the nagle rule for urgent data (or for the final FIN). - * Nagle can be ignored during F-RTO too (see RFC4138). - */ - if (tcp_urg_mode(tp) || (tp->frto_counter == 2) || - (TCP_SKB_CB(skb)->tcp_flags & TCPHDR_FIN)) + /* Don't use the nagle rule for urgent data (or for the final FIN). */ + if (tcp_urg_mode(tp) || (TCP_SKB_CB(skb)->tcp_flags & TCPHDR_FIN)) return true; if (!tcp_nagle_check(tp, skb, cur_mss, nonagle)) diff --git a/net/ipv4/tcp_timer.c b/net/ipv4/tcp_timer.c index eeccf795e917..4b85e6f636c9 100644 --- a/net/ipv4/tcp_timer.c +++ b/net/ipv4/tcp_timer.c @@ -416,11 +416,7 @@ void tcp_retransmit_timer(struct sock *sk) NET_INC_STATS_BH(sock_net(sk), mib_idx); } - if (tcp_use_frto(sk)) { - tcp_enter_frto(sk); - } else { - tcp_enter_loss(sk, 0); - } + tcp_enter_loss(sk, 0); if (tcp_retransmit_skb(sk, tcp_write_queue_head(sk)) > 0) { /* Retransmission failed because of local congestion, diff --git a/net/ipv4/tcp_westwood.c b/net/ipv4/tcp_westwood.c index 1b91bf48e277..76a1e23259e1 100644 --- a/net/ipv4/tcp_westwood.c +++ b/net/ipv4/tcp_westwood.c @@ -236,7 +236,7 @@ static void tcp_westwood_event(struct sock *sk, enum tcp_ca_event event) tp->snd_cwnd = tp->snd_ssthresh = tcp_westwood_bw_rttmin(sk); break; - case CA_EVENT_FRTO: + case CA_EVENT_LOSS: tp->snd_ssthresh = tcp_westwood_bw_rttmin(sk); /* Update RTT_min when next ack arrives */ w->reset_rtt_min = 1; -- cgit v1.2.3 From e33099f96d99c391b3325caa9c44258de04aae86 Mon Sep 17 00:00:00 2001 From: Yuchung Cheng Date: Wed, 20 Mar 2013 13:33:00 +0000 Subject: tcp: implement RFC5682 F-RTO This patch implements F-RTO (foward RTO recovery): When the first retransmission after timeout is acknowledged, F-RTO sends new data instead of old data. If the next ACK acknowledges some never-retransmitted data, then the timeout was spurious and the congestion state is reverted. Otherwise if the next ACK selectively acknowledges the new data, then the timeout was genuine and the loss recovery continues. This idea applies to recurring timeouts as well. While F-RTO sends different data during timeout recovery, it does not (and should not) change the congestion control. The implementaion follows the three steps of SACK enhanced algorithm (section 3) in RFC5682. Step 1 is in tcp_enter_loss(). Step 2 and 3 are in tcp_process_loss(). The basic version is not supported because SACK enhanced version also works for non-SACK connections. The new implementation is functionally in parity with the old F-RTO implementation except the one case where it increases undo events: In addition to the RFC algorithm, a spurious timeout may be detected without sending data in step 2, as long as the SACK confirms not all the original data are dropped. When this happens, the sender will undo the cwnd and perhaps enter fast recovery instead. This additional check increases the F-RTO undo events by 5x compared to the prior implementation on Google Web servers, since the sender often does not have new data to send for HTTP. Note F-RTO may detect spurious timeout before Eifel with timestamps does so. Signed-off-by: Yuchung Cheng Acked-by: Eric Dumazet Acked-by: Neal Cardwell Signed-off-by: David S. Miller --- Documentation/networking/ip-sysctl.txt | 18 +++------ include/linux/tcp.h | 3 +- net/ipv4/tcp_input.c | 73 ++++++++++++++++++++++++++++------ 3 files changed, 68 insertions(+), 26 deletions(-) (limited to 'Documentation') diff --git a/Documentation/networking/ip-sysctl.txt b/Documentation/networking/ip-sysctl.txt index 8a977a0aaede..f98ca633b528 100644 --- a/Documentation/networking/ip-sysctl.txt +++ b/Documentation/networking/ip-sysctl.txt @@ -225,19 +225,13 @@ tcp_fin_timeout - INTEGER Default: 60 seconds tcp_frto - INTEGER - Enables Forward RTO-Recovery (F-RTO) defined in RFC4138. + Enables Forward RTO-Recovery (F-RTO) defined in RFC5682. F-RTO is an enhanced recovery algorithm for TCP retransmission - timeouts. It is particularly beneficial in wireless environments - where packet loss is typically due to random radio interference - rather than intermediate router congestion. F-RTO is sender-side - only modification. Therefore it does not require any support from - the peer. - - If set to 1, basic version is enabled. 2 enables SACK enhanced - F-RTO if flow uses SACK. The basic version can be used also when - SACK is in use though scenario(s) with it exists where F-RTO - interacts badly with the packet counting of the SACK enabled TCP - flow. + timeouts. It is particularly beneficial in networks where the + RTT fluctuates (e.g., wireless). F-RTO is sender-side only + modification. It does not require any support from the peer. + + By default it's enabled with a non-zero value. 0 disables F-RTO. tcp_keepalive_time - INTEGER How often TCP sends out keepalive messages when keepalive is enabled. diff --git a/include/linux/tcp.h b/include/linux/tcp.h index f5f203b36379..5adbc33d1ab3 100644 --- a/include/linux/tcp.h +++ b/include/linux/tcp.h @@ -192,7 +192,8 @@ struct tcp_sock { u8 nonagle : 4,/* Disable Nagle algorithm? */ thin_lto : 1,/* Use linear timeouts for thin streams */ thin_dupack : 1,/* Fast retransmit on first dupack */ - repair : 1; + repair : 1, + frto : 1;/* F-RTO (RFC5682) activated in CA_Loss */ u8 repair_queue; u8 do_early_retrans:1,/* Enable RFC5827 early-retransmit */ syn_data:1, /* SYN includes data */ diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c index 8d821e45b917..b2b36196b342 100644 --- a/net/ipv4/tcp_input.c +++ b/net/ipv4/tcp_input.c @@ -107,6 +107,7 @@ int sysctl_tcp_early_retrans __read_mostly = 3; #define FLAG_DATA_SACKED 0x20 /* New SACK. */ #define FLAG_ECE 0x40 /* ECE in this ACK */ #define FLAG_SLOWPATH 0x100 /* Do not skip RFC checks for window update.*/ +#define FLAG_ORIG_SACK_ACKED 0x200 /* Never retransmitted data are (s)acked */ #define FLAG_SND_UNA_ADVANCED 0x400 /* Snd_una was changed (!= FLAG_DATA_ACKED) */ #define FLAG_DSACKING_ACK 0x800 /* SACK blocks contained D-SACK info */ #define FLAG_SACK_RENEGING 0x2000 /* snd_una advanced to a sacked seq */ @@ -1155,6 +1156,8 @@ static u8 tcp_sacktag_one(struct sock *sk, tcp_highest_sack_seq(tp))) state->reord = min(fack_count, state->reord); + if (!after(end_seq, tp->high_seq)) + state->flag |= FLAG_ORIG_SACK_ACKED; } if (sacked & TCPCB_LOST) { @@ -1835,10 +1838,13 @@ void tcp_enter_loss(struct sock *sk, int how) const struct inet_connection_sock *icsk = inet_csk(sk); struct tcp_sock *tp = tcp_sk(sk); struct sk_buff *skb; + bool new_recovery = false; /* Reduce ssthresh if it has not yet been made inside this window. */ - if (icsk->icsk_ca_state <= TCP_CA_Disorder || tp->snd_una == tp->high_seq || + if (icsk->icsk_ca_state <= TCP_CA_Disorder || + !after(tp->high_seq, tp->snd_una) || (icsk->icsk_ca_state == TCP_CA_Loss && !icsk->icsk_retransmits)) { + new_recovery = true; tp->prior_ssthresh = tcp_current_ssthresh(sk); tp->snd_ssthresh = icsk->icsk_ca_ops->ssthresh(sk); tcp_ca_event(sk, CA_EVENT_LOSS); @@ -1883,6 +1889,14 @@ void tcp_enter_loss(struct sock *sk, int how) tcp_set_ca_state(sk, TCP_CA_Loss); tp->high_seq = tp->snd_nxt; TCP_ECN_queue_cwr(tp); + + /* F-RTO RFC5682 sec 3.1 step 1: retransmit SND.UNA if no previous + * loss recovery is underway except recurring timeout(s) on + * the same SND.UNA (sec 3.2). Disable F-RTO on path MTU probing + */ + tp->frto = sysctl_tcp_frto && + (new_recovery || icsk->icsk_retransmits) && + !inet_csk(sk)->icsk_mtup.probe_size; } /* If ACK arrived pointing to a remembered SACK, it means that our @@ -2426,12 +2440,12 @@ static int tcp_try_undo_partial(struct sock *sk, int acked) return failed; } -/* Undo during loss recovery after partial ACK. */ -static bool tcp_try_undo_loss(struct sock *sk) +/* Undo during loss recovery after partial ACK or using F-RTO. */ +static bool tcp_try_undo_loss(struct sock *sk, bool frto_undo) { struct tcp_sock *tp = tcp_sk(sk); - if (tcp_may_undo(tp)) { + if (frto_undo || tcp_may_undo(tp)) { struct sk_buff *skb; tcp_for_write_queue(skb, sk) { if (skb == tcp_send_head(sk)) @@ -2445,9 +2459,12 @@ static bool tcp_try_undo_loss(struct sock *sk) tp->lost_out = 0; tcp_undo_cwr(sk, true); NET_INC_STATS_BH(sock_net(sk), LINUX_MIB_TCPLOSSUNDO); + if (frto_undo) + NET_INC_STATS_BH(sock_net(sk), + LINUX_MIB_TCPSPURIOUSRTOS); inet_csk(sk)->icsk_retransmits = 0; tp->undo_marker = 0; - if (tcp_is_sack(tp)) + if (frto_undo || tcp_is_sack(tp)) tcp_set_ca_state(sk, TCP_CA_Open); return true; } @@ -2667,24 +2684,52 @@ static void tcp_enter_recovery(struct sock *sk, bool ece_ack) /* Process an ACK in CA_Loss state. Move to CA_Open if lost data are * recovered or spurious. Otherwise retransmits more on partial ACKs. */ -static void tcp_process_loss(struct sock *sk, int flag) +static void tcp_process_loss(struct sock *sk, int flag, bool is_dupack) { struct inet_connection_sock *icsk = inet_csk(sk); struct tcp_sock *tp = tcp_sk(sk); + bool recovered = !before(tp->snd_una, tp->high_seq); - if (!before(tp->snd_una, tp->high_seq)) { + if (tp->frto) { /* F-RTO RFC5682 sec 3.1 (sack enhanced version). */ + if (flag & FLAG_ORIG_SACK_ACKED) { + /* Step 3.b. A timeout is spurious if not all data are + * lost, i.e., never-retransmitted data are (s)acked. + */ + tcp_try_undo_loss(sk, true); + return; + } + if (after(tp->snd_nxt, tp->high_seq) && + (flag & FLAG_DATA_SACKED || is_dupack)) { + tp->frto = 0; /* Loss was real: 2nd part of step 3.a */ + } else if (flag & FLAG_SND_UNA_ADVANCED && !recovered) { + tp->high_seq = tp->snd_nxt; + __tcp_push_pending_frames(sk, tcp_current_mss(sk), + TCP_NAGLE_OFF); + if (after(tp->snd_nxt, tp->high_seq)) + return; /* Step 2.b */ + tp->frto = 0; + } + } + + if (recovered) { + /* F-RTO RFC5682 sec 3.1 step 2.a and 1st part of step 3.a */ icsk->icsk_retransmits = 0; tcp_try_undo_recovery(sk); return; } - if (flag & FLAG_DATA_ACKED) icsk->icsk_retransmits = 0; - if (tcp_is_reno(tp) && flag & FLAG_SND_UNA_ADVANCED) - tcp_reset_reno_sack(tp); - if (tcp_try_undo_loss(sk)) + if (tcp_is_reno(tp)) { + /* A Reno DUPACK means new data in F-RTO step 2.b above are + * delivered. Lower inflight to clock out (re)tranmissions. + */ + if (after(tp->snd_nxt, tp->high_seq) && is_dupack) + tcp_add_reno_sack(sk); + else if (flag & FLAG_SND_UNA_ADVANCED) + tcp_reset_reno_sack(tp); + } + if (tcp_try_undo_loss(sk, false)) return; - tcp_moderate_cwnd(tp); tcp_xmit_retransmit_queue(sk); } @@ -2764,7 +2809,7 @@ static void tcp_fastretrans_alert(struct sock *sk, int pkts_acked, newly_acked_sacked = pkts_acked + tp->sacked_out - prior_sacked; break; case TCP_CA_Loss: - tcp_process_loss(sk, flag); + tcp_process_loss(sk, flag, is_dupack); if (icsk->icsk_ca_state != TCP_CA_Open) return; /* Fall through to processing in Open state. */ @@ -3003,6 +3048,8 @@ static int tcp_clean_rtx_queue(struct sock *sk, int prior_fackets, } if (!(sacked & TCPCB_SACKED_ACKED)) reord = min(pkts_acked, reord); + if (!after(scb->end_seq, tp->high_seq)) + flag |= FLAG_ORIG_SACK_ACKED; } if (sacked & TCPCB_SACKED_ACKED) -- cgit v1.2.3 From 2ec985213864cb64c45dc0284d7316142eefb5d4 Mon Sep 17 00:00:00 2001 From: Florian Fainelli Date: Fri, 22 Mar 2013 03:39:27 +0000 Subject: net: mvmdio: enhance driver to support SMI error/done interrupts This patch enhances the "mvmdio" to support a SMI error/done interrupt line which can be used along with a wait queue instead of doing busy-waiting on the registers. This is a feature which is available in the mv643xx_eth SMI code and thus reduces again the gap between the two. Signed-off-by: Florian Fainelli Signed-off-by: David S. Miller --- .../devicetree/bindings/net/marvell-orion-mdio.txt | 3 + drivers/net/ethernet/marvell/mvmdio.c | 98 ++++++++++++++++++---- 2 files changed, 83 insertions(+), 18 deletions(-) (limited to 'Documentation') diff --git a/Documentation/devicetree/bindings/net/marvell-orion-mdio.txt b/Documentation/devicetree/bindings/net/marvell-orion-mdio.txt index 34e7aafa321c..052b5f28a624 100644 --- a/Documentation/devicetree/bindings/net/marvell-orion-mdio.txt +++ b/Documentation/devicetree/bindings/net/marvell-orion-mdio.txt @@ -9,6 +9,9 @@ Required properties: - compatible: "marvell,orion-mdio" - reg: address and length of the SMI register +Optional properties: +- interrupts: interrupt line number for the SMI error/done interrupt + The child nodes of the MDIO driver are the individual PHY devices connected to this MDIO bus. They must have a "reg" property given the PHY address on the MDIO bus. diff --git a/drivers/net/ethernet/marvell/mvmdio.c b/drivers/net/ethernet/marvell/mvmdio.c index 3e2711d22451..3472574602b2 100644 --- a/drivers/net/ethernet/marvell/mvmdio.c +++ b/drivers/net/ethernet/marvell/mvmdio.c @@ -24,10 +24,13 @@ #include #include #include +#include #include #include #include #include +#include +#include #define MVMDIO_SMI_DATA_SHIFT 0 #define MVMDIO_SMI_PHY_ADDR_SHIFT 16 @@ -36,33 +39,58 @@ #define MVMDIO_SMI_WRITE_OPERATION 0 #define MVMDIO_SMI_READ_VALID BIT(27) #define MVMDIO_SMI_BUSY BIT(28) +#define MVMDIO_ERR_INT_CAUSE 0x007C +#define MVMDIO_ERR_INT_SMI_DONE 0x00000010 +#define MVMDIO_ERR_INT_MASK 0x0080 struct orion_mdio_dev { struct mutex lock; void __iomem *regs; + /* + * If we have access to the error interrupt pin (which is + * somewhat misnamed as it not only reflects internal errors + * but also reflects SMI completion), use that to wait for + * SMI access completion instead of polling the SMI busy bit. + */ + int err_interrupt; + wait_queue_head_t smi_busy_wait; }; +static int orion_mdio_smi_is_done(struct orion_mdio_dev *dev) +{ + return !(readl(dev->regs) & MVMDIO_SMI_BUSY); +} + /* Wait for the SMI unit to be ready for another operation */ static int orion_mdio_wait_ready(struct mii_bus *bus) { struct orion_mdio_dev *dev = bus->priv; int count; - u32 val; - count = 0; - while (1) { - val = readl(dev->regs); - if (!(val & MVMDIO_SMI_BUSY)) - break; + if (dev->err_interrupt <= 0) { + count = 0; + while (1) { + if (orion_mdio_smi_is_done(dev)) + break; - if (count > 100) { - dev_err(bus->parent, "Timeout: SMI busy for too long\n"); - return -ETIMEDOUT; - } + if (count > 100) { + dev_err(bus->parent, + "Timeout: SMI busy for too long\n"); + return -ETIMEDOUT; + } - udelay(10); - count++; + udelay(10); + count++; + } + } else { + if (!orion_mdio_smi_is_done(dev)) { + wait_event_timeout(dev->smi_busy_wait, + orion_mdio_smi_is_done(dev), + msecs_to_jiffies(100)); + if (!orion_mdio_smi_is_done(dev)) + return -ETIMEDOUT; + } } return 0; @@ -141,6 +169,21 @@ static int orion_mdio_reset(struct mii_bus *bus) return 0; } +static irqreturn_t orion_mdio_err_irq(int irq, void *dev_id) +{ + struct orion_mdio_dev *dev = dev_id; + + if (readl(dev->regs + MVMDIO_ERR_INT_CAUSE) & + MVMDIO_ERR_INT_SMI_DONE) { + writel(~MVMDIO_ERR_INT_SMI_DONE, + dev->regs + MVMDIO_ERR_INT_CAUSE); + wake_up(&dev->smi_busy_wait); + return IRQ_HANDLED; + } + + return IRQ_NONE; +} + static int orion_mdio_probe(struct platform_device *pdev) { struct resource *r; @@ -181,9 +224,22 @@ static int orion_mdio_probe(struct platform_device *pdev) dev->regs = devm_ioremap(&pdev->dev, r->start, resource_size(r)); if (!dev->regs) { dev_err(&pdev->dev, "Unable to remap SMI register\n"); - kfree(bus->irq); - mdiobus_free(bus); - return -ENODEV; + ret = -ENODEV; + goto out_mdio; + } + + init_waitqueue_head(&dev->smi_busy_wait); + + dev->err_interrupt = platform_get_irq(pdev, 0); + if (dev->err_interrupt != -ENXIO) { + ret = devm_request_irq(&pdev->dev, dev->err_interrupt, + orion_mdio_err_irq, + IRQF_SHARED, pdev->name, dev); + if (ret) + goto out_mdio; + + writel(MVMDIO_ERR_INT_SMI_DONE, + dev->regs + MVMDIO_ERR_INT_MASK); } mutex_init(&dev->lock); @@ -194,19 +250,25 @@ static int orion_mdio_probe(struct platform_device *pdev) ret = mdiobus_register(bus); if (ret < 0) { dev_err(&pdev->dev, "Cannot register MDIO bus (%d)\n", ret); - kfree(bus->irq); - mdiobus_free(bus); - return ret; + goto out_mdio; } platform_set_drvdata(pdev, bus); return 0; + +out_mdio: + kfree(bus->irq); + mdiobus_free(bus); + return ret; } static int orion_mdio_remove(struct platform_device *pdev) { struct mii_bus *bus = platform_get_drvdata(pdev); + struct orion_mdio_dev *dev = bus->priv; + + writel(0, dev->regs + MVMDIO_ERR_INT_MASK); mdiobus_unregister(bus); kfree(bus->irq); mdiobus_free(bus); -- cgit v1.2.3 From 5e95329b701c4edf6c4d72487ec0369fa148c0bd Mon Sep 17 00:00:00 2001 From: Florian Fainelli Date: Fri, 22 Mar 2013 10:50:50 +0000 Subject: dsa: add device tree bindings to register DSA switches This patch adds support for registering DSA switches using Device Tree bindings. Note that we support programming the switch routing table even though no in-tree user seems to require it. I tested this on Armada 370 with a Marvell 88E6172 (not supported by mainline yet). Signed-off-by: Florian Fainelli Signed-off-by: David S. Miller --- Documentation/devicetree/bindings/net/dsa/dsa.txt | 91 +++++++++ net/dsa/dsa.c | 233 +++++++++++++++++++++- 2 files changed, 319 insertions(+), 5 deletions(-) create mode 100644 Documentation/devicetree/bindings/net/dsa/dsa.txt (limited to 'Documentation') diff --git a/Documentation/devicetree/bindings/net/dsa/dsa.txt b/Documentation/devicetree/bindings/net/dsa/dsa.txt new file mode 100644 index 000000000000..db92f55ef838 --- /dev/null +++ b/Documentation/devicetree/bindings/net/dsa/dsa.txt @@ -0,0 +1,91 @@ +Marvell Distributed Switch Architecture Device Tree Bindings +------------------------------------------------------------ + +Required properties: +- compatible : Should be "marvell,dsa" +- #address-cells : Must be 2, first cell is the address on the MDIO bus + and second cell is the address in the switch tree. + Second cell is used only when cascading/chaining. +- #size-cells : Must be 0 +- dsa,ethernet : Should be a phandle to a valid Ethernet device node +- dsa,mii-bus : Should be a phandle to a valid MDIO bus device node + +Optionnal properties: +- interrupts : property with a value describing the switch + interrupt number (not supported by the driver) + +A DSA node can contain multiple switch chips which are therefore child nodes of +the parent DSA node. The maximum number of allowed child nodes is 4 +(DSA_MAX_SWITCHES). +Each of these switch child nodes should have the following required properties: + +- reg : Describes the switch address on the MII bus +- #address-cells : Must be 1 +- #size-cells : Must be 0 + +A switch may have multiple "port" children nodes + +Each port children node must have the following mandatory properties: +- reg : Describes the port address in the switch +- label : Describes the label associated with this port, special + labels are "cpu" to indicate a CPU port and "dsa" to + indicate an uplink/downlink port. + +Note that a port labelled "dsa" will imply checking for the uplink phandle +described below. + +Optionnal property: +- link : Should be a phandle to another switch's DSA port. + This property is only used when switches are being + chained/cascaded together. + +Example: + + dsa@0 { + compatible = "marvell,dsa"; + #address-cells = <1>; + #size-cells = <0>; + + interrupts = <10>; + dsa,ethernet = <ðernet0>; + dsa,mii-bus = <&mii_bus0>; + + switch@0 { + #address-cells = <1>; + #size-cells = <0>; + reg = <16 0>; /* MDIO address 16, switch 0 in tree */ + + port@0 { + reg = <0>; + label = "lan1"; + }; + + port@1 { + reg = <1>; + label = "lan2"; + }; + + port@5 { + reg = <5>; + label = "cpu"; + }; + + switch0uplink: port@6 { + reg = <6>; + label = "dsa"; + link = <&switch1uplink>; + }; + }; + + switch@1 { + #address-cells = <1>; + #size-cells = <0>; + reg = <17 1>; /* MDIO address 17, switch 1 in tree */ + + switch1uplink: port@0 { + reg = <0>; + label = "dsa"; + link = <&switch0uplink>; + }; + }; + }; diff --git a/net/dsa/dsa.c b/net/dsa/dsa.c index 2bc62ea857c8..908bc11082db 100644 --- a/net/dsa/dsa.c +++ b/net/dsa/dsa.c @@ -1,6 +1,7 @@ /* * net/dsa/dsa.c - Hardware switch handling * Copyright (c) 2008-2009 Marvell Semiconductor + * Copyright (c) 2013 Florian Fainelli * * This program is free software; you can redistribute it and/or modify * it under the terms of the GNU General Public License as published by @@ -14,6 +15,9 @@ #include #include #include +#include +#include +#include #include "dsa_priv.h" char dsa_driver_version[] = "0.1"; @@ -287,34 +291,239 @@ static struct net_device *dev_to_net_device(struct device *dev) return NULL; } +#ifdef CONFIG_OF +static int dsa_of_setup_routing_table(struct dsa_platform_data *pd, + struct dsa_chip_data *cd, + int chip_index, + struct device_node *link) +{ + int ret; + const __be32 *reg; + int link_port_addr; + int link_sw_addr; + struct device_node *parent_sw; + int len; + + parent_sw = of_get_parent(link); + if (!parent_sw) + return -EINVAL; + + reg = of_get_property(parent_sw, "reg", &len); + if (!reg || (len != sizeof(*reg) * 2)) + return -EINVAL; + + link_sw_addr = be32_to_cpup(reg + 1); + + if (link_sw_addr >= pd->nr_chips) + return -EINVAL; + + /* First time routing table allocation */ + if (!cd->rtable) { + cd->rtable = kmalloc(pd->nr_chips * sizeof(s8), GFP_KERNEL); + if (!cd->rtable) + return -ENOMEM; + + /* default to no valid uplink/downlink */ + memset(cd->rtable, -1, pd->nr_chips * sizeof(s8)); + } + + reg = of_get_property(link, "reg", NULL); + if (!reg) { + ret = -EINVAL; + goto out; + } + + link_port_addr = be32_to_cpup(reg); + + cd->rtable[link_sw_addr] = link_port_addr; + + return 0; +out: + kfree(cd->rtable); + return ret; +} + +static int dsa_of_probe(struct platform_device *pdev) +{ + struct device_node *np = pdev->dev.of_node; + struct device_node *child, *mdio, *ethernet, *port, *link; + struct mii_bus *mdio_bus; + struct platform_device *ethernet_dev; + struct dsa_platform_data *pd; + struct dsa_chip_data *cd; + const char *port_name; + int chip_index, port_index; + const unsigned int *sw_addr, *port_reg; + int ret, i; + + mdio = of_parse_phandle(np, "dsa,mii-bus", 0); + if (!mdio) + return -EINVAL; + + mdio_bus = of_mdio_find_bus(mdio); + if (!mdio_bus) + return -EINVAL; + + ethernet = of_parse_phandle(np, "dsa,ethernet", 0); + if (!ethernet) + return -EINVAL; + + ethernet_dev = of_find_device_by_node(ethernet); + if (!ethernet_dev) + return -ENODEV; + + pd = kzalloc(sizeof(*pd), GFP_KERNEL); + if (!pd) + return -ENOMEM; + + pdev->dev.platform_data = pd; + pd->netdev = ðernet_dev->dev; + pd->nr_chips = of_get_child_count(np); + if (pd->nr_chips > DSA_MAX_SWITCHES) + pd->nr_chips = DSA_MAX_SWITCHES; + + pd->chip = kzalloc(pd->nr_chips * sizeof(struct dsa_chip_data), + GFP_KERNEL); + if (!pd->chip) { + ret = -ENOMEM; + goto out_free; + } + + chip_index = 0; + for_each_available_child_of_node(np, child) { + cd = &pd->chip[chip_index]; + + cd->mii_bus = &mdio_bus->dev; + + sw_addr = of_get_property(child, "reg", NULL); + if (!sw_addr) + continue; + + cd->sw_addr = be32_to_cpup(sw_addr); + if (cd->sw_addr > PHY_MAX_ADDR) + continue; + + for_each_available_child_of_node(child, port) { + port_reg = of_get_property(port, "reg", NULL); + if (!port_reg) + continue; + + port_index = be32_to_cpup(port_reg); + + port_name = of_get_property(port, "label", NULL); + if (!port_name) + continue; + + cd->port_names[port_index] = kstrdup(port_name, + GFP_KERNEL); + if (!cd->port_names[port_index]) { + ret = -ENOMEM; + goto out_free_chip; + } + + link = of_parse_phandle(port, "link", 0); + + if (!strcmp(port_name, "dsa") && link && + pd->nr_chips > 1) { + ret = dsa_of_setup_routing_table(pd, cd, + chip_index, link); + if (ret) + goto out_free_chip; + } + + if (port_index == DSA_MAX_PORTS) + break; + } + } + + return 0; + +out_free_chip: + for (i = 0; i < pd->nr_chips; i++) { + port_index = 0; + while (pd->chip[i].port_names && + pd->chip[i].port_names[++port_index]) + kfree(pd->chip[i].port_names[port_index]); + kfree(pd->chip[i].rtable); + } + kfree(pd->chip); +out_free: + kfree(pd); + pdev->dev.platform_data = NULL; + return ret; +} + +static void dsa_of_remove(struct platform_device *pdev) +{ + struct dsa_platform_data *pd = pdev->dev.platform_data; + int i; + int port_index; + + if (!pdev->dev.of_node) + return; + + for (i = 0; i < pd->nr_chips; i++) { + port_index = 0; + while (pd->chip[i].port_names && + pd->chip[i].port_names[++port_index]) + kfree(pd->chip[i].port_names[port_index]); + kfree(pd->chip[i].rtable); + } + + kfree(pd->chip); + kfree(pd); +} +#else +static inline int dsa_of_probe(struct platform_device *pdev) +{ + return 0; +} + +static inline void dsa_of_remove(struct platform_device *pdev) +{ +} +#endif + static int dsa_probe(struct platform_device *pdev) { static int dsa_version_printed; struct dsa_platform_data *pd = pdev->dev.platform_data; struct net_device *dev; struct dsa_switch_tree *dst; - int i; + int i, ret; if (!dsa_version_printed++) printk(KERN_NOTICE "Distributed Switch Architecture " "driver version %s\n", dsa_driver_version); + if (pdev->dev.of_node) { + ret = dsa_of_probe(pdev); + if (ret) + return ret; + + pd = pdev->dev.platform_data; + } + if (pd == NULL || pd->netdev == NULL) return -EINVAL; dev = dev_to_net_device(pd->netdev); - if (dev == NULL) - return -EINVAL; + if (dev == NULL) { + ret = -EINVAL; + goto out; + } if (dev->dsa_ptr != NULL) { dev_put(dev); - return -EEXIST; + ret = -EEXIST; + goto out; } dst = kzalloc(sizeof(*dst), GFP_KERNEL); if (dst == NULL) { dev_put(dev); - return -ENOMEM; + ret = -ENOMEM; + goto out; } platform_set_drvdata(pdev, dst); @@ -366,6 +575,11 @@ static int dsa_probe(struct platform_device *pdev) } return 0; + +out: + dsa_of_remove(pdev); + + return ret; } static int dsa_remove(struct platform_device *pdev) @@ -385,6 +599,8 @@ static int dsa_remove(struct platform_device *pdev) dsa_switch_destroy(ds); } + dsa_of_remove(pdev); + return 0; } @@ -392,6 +608,12 @@ static void dsa_shutdown(struct platform_device *pdev) { } +static const struct of_device_id dsa_of_match_table[] = { + { .compatible = "marvell,dsa", }, + {} +}; +MODULE_DEVICE_TABLE(of, dsa_of_match_table); + static struct platform_driver dsa_driver = { .probe = dsa_probe, .remove = dsa_remove, @@ -399,6 +621,7 @@ static struct platform_driver dsa_driver = { .driver = { .name = "dsa", .owner = THIS_MODULE, + .of_match_table = dsa_of_match_table, }, }; -- cgit v1.2.3 From 4c64f1f70cf3e7860bae12d62a31c137a6a4f4a7 Mon Sep 17 00:00:00 2001 From: Florian Fainelli Date: Mon, 25 Mar 2013 05:03:38 +0000 Subject: dsa: fix device tree binding documentation typo on #address-cells The device tree binding documentation for dsa explicitely states that a DSA node should have its #address-cells property set to 2, yet the example still used 1, fix that typo. Signed-off-by: Florian Fainelli Signed-off-by: David S. Miller --- Documentation/devicetree/bindings/net/dsa/dsa.txt | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) (limited to 'Documentation') diff --git a/Documentation/devicetree/bindings/net/dsa/dsa.txt b/Documentation/devicetree/bindings/net/dsa/dsa.txt index db92f55ef838..49f4f7ae3f51 100644 --- a/Documentation/devicetree/bindings/net/dsa/dsa.txt +++ b/Documentation/devicetree/bindings/net/dsa/dsa.txt @@ -43,7 +43,7 @@ Example: dsa@0 { compatible = "marvell,dsa"; - #address-cells = <1>; + #address-cells = <2>; #size-cells = <0>; interrupts = <10>; -- cgit v1.2.3 From 94fbbbf89492e460979cd10c6384a78a9dbf17ed Mon Sep 17 00:00:00 2001 From: Giuseppe CAVALLARO Date: Tue, 26 Mar 2013 04:43:12 +0000 Subject: stmmac: update the Doc and Version (PTP+SGMII) This patch updates the stmmac.txt file adding information related to the PTP and SGMII/RGMII supports. Also the patch updates the driver version to: March_2013. Signed-off-by: Giuseppe Cavallaro Signed-off-by: David S. Miller --- Documentation/networking/stmmac.txt | 33 ++++++++++++++++++++++++++-- drivers/net/ethernet/stmicro/stmmac/stmmac.h | 2 +- 2 files changed, 32 insertions(+), 3 deletions(-) (limited to 'Documentation') diff --git a/Documentation/networking/stmmac.txt b/Documentation/networking/stmmac.txt index f9fa6db40a52..8efe0b3c8b83 100644 --- a/Documentation/networking/stmmac.txt +++ b/Documentation/networking/stmmac.txt @@ -326,6 +326,35 @@ To enter in Tx LPI mode the driver needs to have a software timer that enable and disable the LPI mode when there is nothing to be transmitted. -7) TODO: +7) Extended descriptors +The extended descriptors give us information about the receive Ethernet payload +when it is carrying PTP packets or TCP/UDP/ICMP over IP. +These are not available on GMAC Synopsys chips older than the 3.50. +At probe time the driver will decide if these can be actually used. +This support also is mandatory for PTPv2 because the extra descriptors 6 and 7 +are used for saving the hardware timestamps. + +8) Precision Time Protocol (PTP) +The driver supports the IEEE 1588-2002, Precision Time Protocol (PTP), +which enables precise synchronization of clocks in measurement and +control systems implemented with technologies such as network +communication. + +In addition to the basic timestamp features mentioned in IEEE 1588-2002 +Timestamps, new GMAC cores support the advanced timestamp features. +IEEE 1588-2008 that can be enabled when configure the Kernel. + +9) SGMII/RGMII supports +New GMAC devices provide own way to manage RGMII/SGMII. +This information is available at run-time by looking at the +HW capability register. This means that the stmmac can manage +auto-negotiation and link status w/o using the PHYLIB stuff +In fact, the HW provides a subset of extended registers to +restart the ANE, verify Full/Half duplex mode and Speed. +Also thanks to these registers it is possible to look at the +Auto-negotiated Link Parter Ability. + +10) TODO: o XGMAC is not supported. - o Add the PTP - precision time protocol + o Complete the TBI & RTBI support. + o extened VLAN support for 3.70a SYNP GMAC. diff --git a/drivers/net/ethernet/stmicro/stmmac/stmmac.h b/drivers/net/ethernet/stmicro/stmmac/stmmac.h index 52002e7c59ae..75f997b467aa 100644 --- a/drivers/net/ethernet/stmicro/stmmac/stmmac.h +++ b/drivers/net/ethernet/stmicro/stmmac/stmmac.h @@ -24,7 +24,7 @@ #define __STMMAC_H__ #define STMMAC_RESOURCE_NAME "stmmaceth" -#define DRV_MODULE_VERSION "Nov_2012" +#define DRV_MODULE_VERSION "March_2013" #include #include -- cgit v1.2.3 From 3078cde792340280b761a0f46f99799a78f4395d Mon Sep 17 00:00:00 2001 From: Ludovic Desroches Date: Mon, 11 Mar 2013 18:26:03 +0100 Subject: can: at91_can: add dt support Add device tree support. Signed-off-by: Ludovic Desroches Signed-off-by: Marc Kleine-Budde --- .../devicetree/bindings/net/can/atmel-can.txt | 14 ++++ drivers/net/can/at91_can.c | 76 ++++++++++++++++------ 2 files changed, 71 insertions(+), 19 deletions(-) create mode 100644 Documentation/devicetree/bindings/net/can/atmel-can.txt (limited to 'Documentation') diff --git a/Documentation/devicetree/bindings/net/can/atmel-can.txt b/Documentation/devicetree/bindings/net/can/atmel-can.txt new file mode 100644 index 000000000000..72cf0c5daff4 --- /dev/null +++ b/Documentation/devicetree/bindings/net/can/atmel-can.txt @@ -0,0 +1,14 @@ +* AT91 CAN * + +Required properties: + - compatible: Should be "atmel,at91sam9263-can" or "atmel,at91sam9x5-can" + - reg: Should contain CAN controller registers location and length + - interrupts: Should contain IRQ line for the CAN controller + +Example: + + can0: can@f000c000 { + compatbile = "atmel,at91sam9x5-can"; + reg = <0xf000c000 0x300>; + interrupts = <40 4 5> + }; diff --git a/drivers/net/can/at91_can.c b/drivers/net/can/at91_can.c index 44f363792b59..db52f4414def 100644 --- a/drivers/net/can/at91_can.c +++ b/drivers/net/can/at91_can.c @@ -27,6 +27,7 @@ #include #include #include +#include #include #include #include @@ -155,19 +156,20 @@ struct at91_priv { canid_t mb0_id; }; -static const struct at91_devtype_data at91_devtype_data[] = { - [AT91_DEVTYPE_SAM9263] = { - .rx_first = 1, - .rx_split = 8, - .rx_last = 11, - .tx_shift = 2, - }, - [AT91_DEVTYPE_SAM9X5] = { - .rx_first = 0, - .rx_split = 4, - .rx_last = 5, - .tx_shift = 1, - }, +static const struct at91_devtype_data at91_at91sam9263_data = { + .rx_first = 1, + .rx_split = 8, + .rx_last = 11, + .tx_shift = 2, + .type = AT91_DEVTYPE_SAM9263, +}; + +static const struct at91_devtype_data at91_at91sam9x5_data = { + .rx_first = 0, + .rx_split = 4, + .rx_last = 5, + .tx_shift = 1, + .type = AT91_DEVTYPE_SAM9X5, }; static const struct can_bittiming_const at91_bittiming_const = { @@ -1249,10 +1251,42 @@ static struct attribute_group at91_sysfs_attr_group = { .attrs = at91_sysfs_attrs, }; +#if defined(CONFIG_OF) +static const struct of_device_id at91_can_dt_ids[] = { + { + .compatible = "atmel,at91sam9x5-can", + .data = &at91_at91sam9x5_data, + }, { + .compatible = "atmel,at91sam9263-can", + .data = &at91_at91sam9263_data, + }, { + /* sentinel */ + } +}; +MODULE_DEVICE_TABLE(of, at91_can_dt_ids); +#else +#define at91_can_dt_ids NULL +#endif + +static const struct at91_devtype_data *at91_can_get_driver_data(struct platform_device *pdev) +{ + if (pdev->dev.of_node) { + const struct of_device_id *match; + + match = of_match_node(at91_can_dt_ids, pdev->dev.of_node); + if (!match) { + dev_err(&pdev->dev, "no matching node found in dtb\n"); + return NULL; + } + return (const struct at91_devtype_data *)match->data; + } + return (const struct at91_devtype_data *) + platform_get_device_id(pdev)->driver_data; +} + static int at91_can_probe(struct platform_device *pdev) { const struct at91_devtype_data *devtype_data; - enum at91_devtype devtype; struct net_device *dev; struct at91_priv *priv; struct resource *res; @@ -1260,8 +1294,12 @@ static int at91_can_probe(struct platform_device *pdev) void __iomem *addr; int err, irq; - devtype = pdev->id_entry->driver_data; - devtype_data = &at91_devtype_data[devtype]; + devtype_data = at91_can_get_driver_data(pdev); + if (!devtype_data) { + dev_err(&pdev->dev, "no driver data\n"); + err = -ENODEV; + goto exit; + } clk = clk_get(&pdev->dev, "can_clk"); if (IS_ERR(clk)) { @@ -1310,7 +1348,6 @@ static int at91_can_probe(struct platform_device *pdev) priv->dev = dev; priv->reg_base = addr; priv->devtype_data = *devtype_data; - priv->devtype_data.type = devtype; priv->clk = clk; priv->pdata = pdev->dev.platform_data; priv->mb0_id = 0x7ff; @@ -1373,10 +1410,10 @@ static int at91_can_remove(struct platform_device *pdev) static const struct platform_device_id at91_can_id_table[] = { { .name = "at91_can", - .driver_data = AT91_DEVTYPE_SAM9263, + .driver_data = (kernel_ulong_t)&at91_at91sam9x5_data, }, { .name = "at91sam9x5_can", - .driver_data = AT91_DEVTYPE_SAM9X5, + .driver_data = (kernel_ulong_t)&at91_at91sam9263_data, }, { /* sentinel */ } @@ -1389,6 +1426,7 @@ static struct platform_driver at91_can_driver = { .driver = { .name = KBUILD_MODNAME, .owner = THIS_MODULE, + .of_match_table = at91_can_dt_ids, }, .id_table = at91_can_id_table, }; -- cgit v1.2.3 From 4eb06148250f92e1e58bf069c309dac173e8b5f7 Mon Sep 17 00:00:00 2001 From: Daniel Borkmann Date: Fri, 29 Mar 2013 05:36:29 +0000 Subject: doc: packet: add minimal TPACKET_V3 example code Lost in space for a long time, but it finally came back to us from some ancient code tombs. This patch adds a minimal runnable example of Linux' packet mmap(2) from Chetan Loke's TPACKET_V3. Special thanks to David S. Miller, and also Eric Leblond and Victor Julien! Cc: Eric Leblond Cc: Victor Julien Signed-off-by: Daniel Borkmann Signed-off-by: David S. Miller --- Documentation/networking/packet_mmap.txt | 327 +++++++++++++++++++++++++++++++ 1 file changed, 327 insertions(+) (limited to 'Documentation') diff --git a/Documentation/networking/packet_mmap.txt b/Documentation/networking/packet_mmap.txt index 94444b152fbc..65efb85e49de 100644 --- a/Documentation/networking/packet_mmap.txt +++ b/Documentation/networking/packet_mmap.txt @@ -684,6 +684,333 @@ int main(int argc, char **argp) return 0; } +------------------------------------------------------------------------------- ++ AF_PACKET TPACKET_V3 example +------------------------------------------------------------------------------- + +AF_PACKET's TPACKET_V3 ring buffer can be configured to use non-static frame +sizes by doing it's own memory management. It is based on blocks where polling +works on a per block basis instead of per ring as in TPACKET_V2 and predecessor. + +It is said that TPACKET_V3 brings the following benefits: + *) ~15 - 20% reduction in CPU-usage + *) ~20% increase in packet capture rate + *) ~2x increase in packet density + *) Port aggregation analysis + *) Non static frame size to capture entire packet payload + +So it seems to be a good candidate to be used with packet fanout. + +Minimal example code by Daniel Borkmann based on Chetan Loke's lolpcap (compile +it with gcc -Wall -O2 blob.c, and try things like "./a.out eth0", etc.): + +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include + +#define BLOCK_SIZE (1 << 22) +#define FRAME_SIZE 2048 + +#define NUM_BLOCKS 64 +#define NUM_FRAMES ((BLOCK_SIZE * NUM_BLOCKS) / FRAME_SIZE) + +#define BLOCK_RETIRE_TOV_IN_MS 64 +#define BLOCK_PRIV_AREA_SZ 13 + +#define ALIGN_8(x) (((x) + 8 - 1) & ~(8 - 1)) + +#define BLOCK_STATUS(x) ((x)->h1.block_status) +#define BLOCK_NUM_PKTS(x) ((x)->h1.num_pkts) +#define BLOCK_O2FP(x) ((x)->h1.offset_to_first_pkt) +#define BLOCK_LEN(x) ((x)->h1.blk_len) +#define BLOCK_SNUM(x) ((x)->h1.seq_num) +#define BLOCK_O2PRIV(x) ((x)->offset_to_priv) +#define BLOCK_PRIV(x) ((void *) ((uint8_t *) (x) + BLOCK_O2PRIV(x))) +#define BLOCK_HDR_LEN (ALIGN_8(sizeof(struct block_desc))) +#define BLOCK_PLUS_PRIV(sz_pri) (BLOCK_HDR_LEN + ALIGN_8((sz_pri))) + +#ifndef likely +# define likely(x) __builtin_expect(!!(x), 1) +#endif +#ifndef unlikely +# define unlikely(x) __builtin_expect(!!(x), 0) +#endif + +struct block_desc { + uint32_t version; + uint32_t offset_to_priv; + struct tpacket_hdr_v1 h1; +}; + +struct ring { + struct iovec *rd; + uint8_t *map; + struct tpacket_req3 req; +}; + +static unsigned long packets_total = 0, bytes_total = 0; +static sig_atomic_t sigint = 0; + +void sighandler(int num) +{ + sigint = 1; +} + +static int setup_socket(struct ring *ring, char *netdev) +{ + int err, i, fd, v = TPACKET_V3; + struct sockaddr_ll ll; + + fd = socket(AF_PACKET, SOCK_RAW, htons(ETH_P_ALL)); + if (fd < 0) { + perror("socket"); + exit(1); + } + + err = setsockopt(fd, SOL_PACKET, PACKET_VERSION, &v, sizeof(v)); + if (err < 0) { + perror("setsockopt"); + exit(1); + } + + memset(&ring->req, 0, sizeof(ring->req)); + ring->req.tp_block_size = BLOCK_SIZE; + ring->req.tp_frame_size = FRAME_SIZE; + ring->req.tp_block_nr = NUM_BLOCKS; + ring->req.tp_frame_nr = NUM_FRAMES; + ring->req.tp_retire_blk_tov = BLOCK_RETIRE_TOV_IN_MS; + ring->req.tp_sizeof_priv = BLOCK_PRIV_AREA_SZ; + ring->req.tp_feature_req_word |= TP_FT_REQ_FILL_RXHASH; + + err = setsockopt(fd, SOL_PACKET, PACKET_RX_RING, &ring->req, + sizeof(ring->req)); + if (err < 0) { + perror("setsockopt"); + exit(1); + } + + ring->map = mmap(NULL, ring->req.tp_block_size * ring->req.tp_block_nr, + PROT_READ | PROT_WRITE, MAP_SHARED | MAP_LOCKED, + fd, 0); + if (ring->map == MAP_FAILED) { + perror("mmap"); + exit(1); + } + + ring->rd = malloc(ring->req.tp_block_nr * sizeof(*ring->rd)); + assert(ring->rd); + for (i = 0; i < ring->req.tp_block_nr; ++i) { + ring->rd[i].iov_base = ring->map + (i * ring->req.tp_block_size); + ring->rd[i].iov_len = ring->req.tp_block_size; + } + + memset(&ll, 0, sizeof(ll)); + ll.sll_family = PF_PACKET; + ll.sll_protocol = htons(ETH_P_ALL); + ll.sll_ifindex = if_nametoindex(netdev); + ll.sll_hatype = 0; + ll.sll_pkttype = 0; + ll.sll_halen = 0; + + err = bind(fd, (struct sockaddr *) &ll, sizeof(ll)); + if (err < 0) { + perror("bind"); + exit(1); + } + + return fd; +} + +#ifdef __checked +static uint64_t prev_block_seq_num = 0; + +void assert_block_seq_num(struct block_desc *pbd) +{ + if (unlikely(prev_block_seq_num + 1 != BLOCK_SNUM(pbd))) { + printf("prev_block_seq_num:%"PRIu64", expected seq:%"PRIu64" != " + "actual seq:%"PRIu64"\n", prev_block_seq_num, + prev_block_seq_num + 1, (uint64_t) BLOCK_SNUM(pbd)); + exit(1); + } + + prev_block_seq_num = BLOCK_SNUM(pbd); +} + +static void assert_block_len(struct block_desc *pbd, uint32_t bytes, int block_num) +{ + if (BLOCK_NUM_PKTS(pbd)) { + if (unlikely(bytes != BLOCK_LEN(pbd))) { + printf("block:%u with %upackets, expected len:%u != actual len:%u\n", + block_num, BLOCK_NUM_PKTS(pbd), bytes, BLOCK_LEN(pbd)); + exit(1); + } + } else { + if (unlikely(BLOCK_LEN(pbd) != BLOCK_PLUS_PRIV(BLOCK_PRIV_AREA_SZ))) { + printf("block:%u, expected len:%lu != actual len:%u\n", + block_num, BLOCK_HDR_LEN, BLOCK_LEN(pbd)); + exit(1); + } + } +} + +static void assert_block_header(struct block_desc *pbd, const int block_num) +{ + uint32_t block_status = BLOCK_STATUS(pbd); + + if (unlikely((block_status & TP_STATUS_USER) == 0)) { + printf("block:%u, not in TP_STATUS_USER\n", block_num); + exit(1); + } + + assert_block_seq_num(pbd); +} +#else +static inline void assert_block_header(struct block_desc *pbd, const int block_num) +{ +} +static void assert_block_len(struct block_desc *pbd, uint32_t bytes, int block_num) +{ +} +#endif + +static void display(struct tpacket3_hdr *ppd) +{ + struct ethhdr *eth = (struct ethhdr *) ((uint8_t *) ppd + ppd->tp_mac); + struct iphdr *ip = (struct iphdr *) ((uint8_t *) eth + ETH_HLEN); + + if (eth->h_proto == htons(ETH_P_IP)) { + struct sockaddr_in ss, sd; + char sbuff[NI_MAXHOST], dbuff[NI_MAXHOST]; + + memset(&ss, 0, sizeof(ss)); + ss.sin_family = PF_INET; + ss.sin_addr.s_addr = ip->saddr; + getnameinfo((struct sockaddr *) &ss, sizeof(ss), + sbuff, sizeof(sbuff), NULL, 0, NI_NUMERICHOST); + + memset(&sd, 0, sizeof(sd)); + sd.sin_family = PF_INET; + sd.sin_addr.s_addr = ip->daddr; + getnameinfo((struct sockaddr *) &sd, sizeof(sd), + dbuff, sizeof(dbuff), NULL, 0, NI_NUMERICHOST); + + printf("%s -> %s, ", sbuff, dbuff); + } + + printf("rxhash: 0x%x\n", ppd->hv1.tp_rxhash); +} + +static void walk_block(struct block_desc *pbd, const int block_num) +{ + int num_pkts = BLOCK_NUM_PKTS(pbd), i; + unsigned long bytes = 0; + unsigned long bytes_with_padding = BLOCK_PLUS_PRIV(BLOCK_PRIV_AREA_SZ); + struct tpacket3_hdr *ppd; + + assert_block_header(pbd, block_num); + + ppd = (struct tpacket3_hdr *) ((uint8_t *) pbd + BLOCK_O2FP(pbd)); + for (i = 0; i < num_pkts; ++i) { + bytes += ppd->tp_snaplen; + if (ppd->tp_next_offset) + bytes_with_padding += ppd->tp_next_offset; + else + bytes_with_padding += ALIGN_8(ppd->tp_snaplen + ppd->tp_mac); + + display(ppd); + + ppd = (struct tpacket3_hdr *) ((uint8_t *) ppd + ppd->tp_next_offset); + __sync_synchronize(); + } + + assert_block_len(pbd, bytes_with_padding, block_num); + + packets_total += num_pkts; + bytes_total += bytes; +} + +void flush_block(struct block_desc *pbd) +{ + BLOCK_STATUS(pbd) = TP_STATUS_KERNEL; + __sync_synchronize(); +} + +static void teardown_socket(struct ring *ring, int fd) +{ + munmap(ring->map, ring->req.tp_block_size * ring->req.tp_block_nr); + free(ring->rd); + close(fd); +} + +int main(int argc, char **argp) +{ + int fd, err; + socklen_t len; + struct ring ring; + struct pollfd pfd; + unsigned int block_num = 0; + struct block_desc *pbd; + struct tpacket_stats_v3 stats; + + if (argc != 2) { + fprintf(stderr, "Usage: %s INTERFACE\n", argp[0]); + return EXIT_FAILURE; + } + + signal(SIGINT, sighandler); + + memset(&ring, 0, sizeof(ring)); + fd = setup_socket(&ring, argp[argc - 1]); + assert(fd > 0); + + memset(&pfd, 0, sizeof(pfd)); + pfd.fd = fd; + pfd.events = POLLIN | POLLERR; + pfd.revents = 0; + + while (likely(!sigint)) { + pbd = (struct block_desc *) ring.rd[block_num].iov_base; +retry_block: + if ((BLOCK_STATUS(pbd) & TP_STATUS_USER) == 0) { + poll(&pfd, 1, -1); + goto retry_block; + } + + walk_block(pbd, block_num); + flush_block(pbd); + block_num = (block_num + 1) % NUM_BLOCKS; + } + + len = sizeof(stats); + err = getsockopt(fd, SOL_PACKET, PACKET_STATISTICS, &stats, &len); + if (err < 0) { + perror("getsockopt"); + exit(1); + } + + fflush(stdout); + printf("\nReceived %u packets, %lu bytes, %u dropped, freeze_q_cnt: %u\n", + stats.tp_packets, bytes_total, stats.tp_drops, + stats.tp_freeze_q_cnt); + + teardown_socket(&ring, fd); + return 0; +} + ------------------------------------------------------------------------------- + PACKET_TIMESTAMP ------------------------------------------------------------------------------- -- cgit v1.2.3 From 56aa091d60a63fee83d2c894edb69b7c159966c7 Mon Sep 17 00:00:00 2001 From: Werner Almesberger Date: Thu, 4 Apr 2013 06:32:35 +0000 Subject: ieee802154/nl-mac.c: make some MLME operations optional Check for NULL before calling the following operations from "struct ieee802154_mlme_ops": assoc_req, assoc_resp, disassoc_req, start_req, and scan_req. This fixes a current oops where those functions are called but not implemented. It also updates the documentation to clarify that they are now optional by design. If a call to an unimplemented function is attempted, the kernel returns EOPNOTSUPP via netlink. The following operations are still required: get_phy, get_pan_id, get_short_addr, and get_dsn. Note that the places where this patch changes the initialization of "ret" should not affect the rest of the code since "ret" was always set (again) before returning its value. Signed-off-by: Werner Almesberger Signed-off-by: David S. Miller --- Documentation/networking/ieee802154.txt | 5 +++-- include/net/ieee802154_netdev.h | 4 ++++ net/ieee802154/nl-mac.c | 25 ++++++++++++++++++++----- 3 files changed, 27 insertions(+), 7 deletions(-) (limited to 'Documentation') diff --git a/Documentation/networking/ieee802154.txt b/Documentation/networking/ieee802154.txt index 703cf4370c79..67a9cb259d40 100644 --- a/Documentation/networking/ieee802154.txt +++ b/Documentation/networking/ieee802154.txt @@ -71,8 +71,9 @@ submits skb to qdisc), so if you need something from that cb later, you should store info in the skb->data on your own. To hook the MLME interface you have to populate the ml_priv field of your -net_device with a pointer to struct ieee802154_mlme_ops instance. All fields are -required. +net_device with a pointer to struct ieee802154_mlme_ops instance. The fields +assoc_req, assoc_resp, disassoc_req, start_req, and scan_req are optional. +All other fields are required. We provide an example of simple HardMAC driver at drivers/ieee802154/fakehard.c diff --git a/include/net/ieee802154_netdev.h b/include/net/ieee802154_netdev.h index 642f94c0fa2f..8196d5d40359 100644 --- a/include/net/ieee802154_netdev.h +++ b/include/net/ieee802154_netdev.h @@ -85,6 +85,8 @@ struct wpan_phy; * Use wpan_wpy_put to put that reference. */ struct ieee802154_mlme_ops { + /* The following fields are optional (can be NULL). */ + int (*assoc_req)(struct net_device *dev, struct ieee802154_addr *addr, u8 channel, u8 page, u8 cap); @@ -101,6 +103,8 @@ struct ieee802154_mlme_ops { int (*scan_req)(struct net_device *dev, u8 type, u32 channels, u8 page, u8 duration); + /* The fields below are required. */ + struct wpan_phy *(*get_phy)(const struct net_device *dev); /* diff --git a/net/ieee802154/nl-mac.c b/net/ieee802154/nl-mac.c index 96bb08abece2..b0bdd8c51e9c 100644 --- a/net/ieee802154/nl-mac.c +++ b/net/ieee802154/nl-mac.c @@ -315,7 +315,7 @@ static int ieee802154_associate_req(struct sk_buff *skb, struct net_device *dev; struct ieee802154_addr addr; u8 page; - int ret = -EINVAL; + int ret = -EOPNOTSUPP; if (!info->attrs[IEEE802154_ATTR_CHANNEL] || !info->attrs[IEEE802154_ATTR_COORD_PAN_ID] || @@ -327,6 +327,8 @@ static int ieee802154_associate_req(struct sk_buff *skb, dev = ieee802154_nl_get_dev(info); if (!dev) return -ENODEV; + if (!ieee802154_mlme_ops(dev)->assoc_req) + goto out; if (info->attrs[IEEE802154_ATTR_COORD_HW_ADDR]) { addr.addr_type = IEEE802154_ADDR_LONG; @@ -350,6 +352,7 @@ static int ieee802154_associate_req(struct sk_buff *skb, page, nla_get_u8(info->attrs[IEEE802154_ATTR_CAPABILITY])); +out: dev_put(dev); return ret; } @@ -359,7 +362,7 @@ static int ieee802154_associate_resp(struct sk_buff *skb, { struct net_device *dev; struct ieee802154_addr addr; - int ret = -EINVAL; + int ret = -EOPNOTSUPP; if (!info->attrs[IEEE802154_ATTR_STATUS] || !info->attrs[IEEE802154_ATTR_DEST_HW_ADDR] || @@ -369,6 +372,8 @@ static int ieee802154_associate_resp(struct sk_buff *skb, dev = ieee802154_nl_get_dev(info); if (!dev) return -ENODEV; + if (!ieee802154_mlme_ops(dev)->assoc_resp) + goto out; addr.addr_type = IEEE802154_ADDR_LONG; nla_memcpy(addr.hwaddr, info->attrs[IEEE802154_ATTR_DEST_HW_ADDR], @@ -380,6 +385,7 @@ static int ieee802154_associate_resp(struct sk_buff *skb, nla_get_u16(info->attrs[IEEE802154_ATTR_DEST_SHORT_ADDR]), nla_get_u8(info->attrs[IEEE802154_ATTR_STATUS])); +out: dev_put(dev); return ret; } @@ -389,7 +395,7 @@ static int ieee802154_disassociate_req(struct sk_buff *skb, { struct net_device *dev; struct ieee802154_addr addr; - int ret = -EINVAL; + int ret = -EOPNOTSUPP; if ((!info->attrs[IEEE802154_ATTR_DEST_HW_ADDR] && !info->attrs[IEEE802154_ATTR_DEST_SHORT_ADDR]) || @@ -399,6 +405,8 @@ static int ieee802154_disassociate_req(struct sk_buff *skb, dev = ieee802154_nl_get_dev(info); if (!dev) return -ENODEV; + if (!ieee802154_mlme_ops(dev)->disassoc_req) + goto out; if (info->attrs[IEEE802154_ATTR_DEST_HW_ADDR]) { addr.addr_type = IEEE802154_ADDR_LONG; @@ -415,6 +423,7 @@ static int ieee802154_disassociate_req(struct sk_buff *skb, ret = ieee802154_mlme_ops(dev)->disassoc_req(dev, &addr, nla_get_u8(info->attrs[IEEE802154_ATTR_REASON])); +out: dev_put(dev); return ret; } @@ -432,7 +441,7 @@ static int ieee802154_start_req(struct sk_buff *skb, struct genl_info *info) u8 channel, bcn_ord, sf_ord; u8 page; int pan_coord, blx, coord_realign; - int ret; + int ret = -EOPNOTSUPP; if (!info->attrs[IEEE802154_ATTR_COORD_PAN_ID] || !info->attrs[IEEE802154_ATTR_COORD_SHORT_ADDR] || @@ -448,6 +457,8 @@ static int ieee802154_start_req(struct sk_buff *skb, struct genl_info *info) dev = ieee802154_nl_get_dev(info); if (!dev) return -ENODEV; + if (!ieee802154_mlme_ops(dev)->start_req) + goto out; addr.addr_type = IEEE802154_ADDR_SHORT; addr.short_addr = nla_get_u16( @@ -476,6 +487,7 @@ static int ieee802154_start_req(struct sk_buff *skb, struct genl_info *info) ret = ieee802154_mlme_ops(dev)->start_req(dev, &addr, channel, page, bcn_ord, sf_ord, pan_coord, blx, coord_realign); +out: dev_put(dev); return ret; } @@ -483,7 +495,7 @@ static int ieee802154_start_req(struct sk_buff *skb, struct genl_info *info) static int ieee802154_scan_req(struct sk_buff *skb, struct genl_info *info) { struct net_device *dev; - int ret; + int ret = -EOPNOTSUPP; u8 type; u32 channels; u8 duration; @@ -497,6 +509,8 @@ static int ieee802154_scan_req(struct sk_buff *skb, struct genl_info *info) dev = ieee802154_nl_get_dev(info); if (!dev) return -ENODEV; + if (!ieee802154_mlme_ops(dev)->scan_req) + goto out; type = nla_get_u8(info->attrs[IEEE802154_ATTR_SCAN_TYPE]); channels = nla_get_u32(info->attrs[IEEE802154_ATTR_CHANNELS]); @@ -511,6 +525,7 @@ static int ieee802154_scan_req(struct sk_buff *skb, struct genl_info *info) ret = ieee802154_mlme_ops(dev)->scan_req(dev, type, channels, page, duration); +out: dev_put(dev); return ret; } -- cgit v1.2.3 From 49cfbf675cd18330cef3c4613e890b0044510e95 Mon Sep 17 00:00:00 2001 From: Giuseppe CAVALLARO Date: Mon, 8 Apr 2013 02:09:59 +0000 Subject: stmmac: review driver documentation This patch reviews the driver documentation file; for example, there were some new fields (in the driver module parameter section) and the ptp files were not documented. Signed-off-by: Giuseppe Cavallaro Signed-off-by: David S. Miller --- Documentation/networking/stmmac.txt | 12 ++++++++---- 1 file changed, 8 insertions(+), 4 deletions(-) (limited to 'Documentation') diff --git a/Documentation/networking/stmmac.txt b/Documentation/networking/stmmac.txt index 8efe0b3c8b83..654d2e55c8cb 100644 --- a/Documentation/networking/stmmac.txt +++ b/Documentation/networking/stmmac.txt @@ -1,6 +1,6 @@ STMicroelectronics 10/100/1000 Synopsys Ethernet driver -Copyright (C) 2007-2010 STMicroelectronics Ltd +Copyright (C) 2007-2013 STMicroelectronics Ltd Author: Giuseppe Cavallaro This is the driver for the MAC 10/100/1000 on-chip Ethernet controllers @@ -10,7 +10,7 @@ Currently this network device driver is for all STM embedded MAC/GMAC (i.e. 7xxx/5xxx SoCs), SPEAr (arm), Loongson1B (mips) and XLINX XC2V3000 FF1152AMT0221 D1215994A VIRTEX FPGA board. -DWC Ether MAC 10/100/1000 Universal version 3.60a (and older) and DWC Ether +DWC Ether MAC 10/100/1000 Universal version 3.70a (and older) and DWC Ether MAC 10/100 Universal version 4.0 have been used for developing this driver. This driver supports both the platform bus and PCI. @@ -32,6 +32,8 @@ The kernel configuration option is STMMAC_ETH: watchdog: transmit timeout (in milliseconds); flow_ctrl: Flow control ability [on/off]; pause: Flow Control Pause Time; + eee_timer: tx EEE timer; + chain_mode: select chain mode instead of ring. 3) Command line options Driver parameters can be also passed in command line by using: @@ -164,12 +166,12 @@ Where: o bus_setup: perform HW setup of the bus. For example, on some ST platforms this field is used to configure the AMBA bridge to generate more efficient STBus traffic. - o init/exit: callbacks used for calling a custom initialisation; + o init/exit: callbacks used for calling a custom initialization; this is sometime necessary on some platforms (e.g. ST boxes) where the HW needs to have set some PIO lines or system cfg registers. o custom_cfg/custom_data: this is a custom configuration that can be passed - while initialising the resources. + while initializing the resources. o bsp_priv: another private poiter. For MDIO bus The we have: @@ -273,6 +275,8 @@ reset procedure etc). o norm_desc.c: functions for handling normal descriptors; o chain_mode.c/ring_mode.c:: functions to manage RING/CHAINED modes; o mmc_core.c/mmc.h: Management MAC Counters; + o stmmac_hwtstamp.c: HW timestamp support for PTP + o stmmac_ptp.c: PTP 1588 clock 5) Debug Information -- cgit v1.2.3 From 077f02f1baee7c1eed800e95c7bb76a6be5226d4 Mon Sep 17 00:00:00 2001 From: Rami Rosen Date: Mon, 8 Apr 2013 04:04:20 +0000 Subject: Documentation: cgroup: add documentation for net_cls cgroups. This patch adds a new file, Documentation/cgroups/net_cls.txt, with info about net_cls cgroups, and updates the 00-INDEX accordingly. Signed-off-by: Rami Rosen Signed-off-by: David S. Miller --- Documentation/cgroups/00-INDEX | 2 ++ Documentation/cgroups/net_cls.txt | 34 ++++++++++++++++++++++++++++++++++ 2 files changed, 36 insertions(+) create mode 100644 Documentation/cgroups/net_cls.txt (limited to 'Documentation') diff --git a/Documentation/cgroups/00-INDEX b/Documentation/cgroups/00-INDEX index f5635a09c3f6..bc461b6425a7 100644 --- a/Documentation/cgroups/00-INDEX +++ b/Documentation/cgroups/00-INDEX @@ -18,6 +18,8 @@ memcg_test.txt - Memory Resource Controller; implementation details. memory.txt - Memory Resource Controller; design, accounting, interface, testing. +net_cls.txt + - Network classifier cgroups details and usages. net_prio.txt - Network priority cgroups details and usages. resource_counter.txt diff --git a/Documentation/cgroups/net_cls.txt b/Documentation/cgroups/net_cls.txt new file mode 100644 index 000000000000..9face6bb578a --- /dev/null +++ b/Documentation/cgroups/net_cls.txt @@ -0,0 +1,34 @@ +Network classifier cgroup +------------------------- + +The Network classifier cgroup provides an interface to +tag network packets with a class identifier (classid). + +The Traffic Controller (tc) can be used to assign +different priorities to packets from different cgroups. + +Creating a net_cls cgroups instance creates a net_cls.classid file. +This net_cls.classid value is initialized to 0. + +You can write hexadecimal values to net_cls.classid; the format for these +values is 0xAAAABBBB; AAAA is the major handle number and BBBB +is the minor handle number. +Reading net_cls.classid yields a decimal result. + +Example: +mkdir /sys/fs/cgroup/net_cls +mount -t cgroup -onet_cls net_cls /sys/fs/cgroup/net_cls +mkdir /sys/fs/cgroup/net_cls/0 +echo 0x100001 > /sys/fs/cgroup/net_cls/0/net_cls.classid + - setting a 10:1 handle. + +cat /sys/fs/cgroup/net_cls/0/net_cls.classid +1048577 + +configuring tc: +tc qdisc add dev eth0 root handle 10: htb + +tc class add dev eth0 parent 10: classid 10:1 htb rate 40mbit + - creating traffic class 10:1 + +tc filter add dev eth0 parent 10: protocol ip prio 10 handle 1: cgroup -- cgit v1.2.3 From b8075daf552b37b668618f47d15ce2e2e7a1b258 Mon Sep 17 00:00:00 2001 From: Sebastian Hesselbarth Date: Wed, 10 Apr 2013 23:24:48 +0000 Subject: net: mvmdio: add clocks property to binding documentation Commit 3d604da1e9547c09c9dcc0ee443c306c9ae1a480 ("net: mvmdio: get and enable optional clock") was missing an update of the corresponding device tree binding documentation. This patch adds the clocks property to mvmdio binding documentation. Signed-off-by: Sebastian Hesselbarth Signed-off-by: David S. Miller --- Documentation/devicetree/bindings/net/marvell-orion-mdio.txt | 1 + 1 file changed, 1 insertion(+) (limited to 'Documentation') diff --git a/Documentation/devicetree/bindings/net/marvell-orion-mdio.txt b/Documentation/devicetree/bindings/net/marvell-orion-mdio.txt index 052b5f28a624..9417e54c26c0 100644 --- a/Documentation/devicetree/bindings/net/marvell-orion-mdio.txt +++ b/Documentation/devicetree/bindings/net/marvell-orion-mdio.txt @@ -11,6 +11,7 @@ Required properties: Optional properties: - interrupts: interrupt line number for the SMI error/done interrupt +- clocks: Phandle to the clock control device and gate bit The child nodes of the MDIO driver are the individual PHY devices connected to this MDIO bus. They must have a "reg" property given the -- cgit v1.2.3 From 209224862cabf7a871d680c448148ef6376bf98b Mon Sep 17 00:00:00 2001 From: Sebastian Hesselbarth Date: Wed, 10 Apr 2013 23:29:33 +0000 Subject: net: mv643xx_eth: add shared clk and cleanup existing clk handling This patch adds an optional shared block clock to avoid lockups on clock gated controllers. Besides the new clock, clock handling for existing clocks is cleaned up and moved to devm_clk_get. Device tree binding documentation is updated for the new clocks property. Signed-off-by: Sebastian Hesselbarth Signed-off-by: David S. Miller --- Documentation/devicetree/bindings/marvell.txt | 3 +++ drivers/net/ethernet/marvell/mv643xx_eth.c | 27 +++++++++++---------------- 2 files changed, 14 insertions(+), 16 deletions(-) (limited to 'Documentation') diff --git a/Documentation/devicetree/bindings/marvell.txt b/Documentation/devicetree/bindings/marvell.txt index f1533d91953a..f7a0da6b4022 100644 --- a/Documentation/devicetree/bindings/marvell.txt +++ b/Documentation/devicetree/bindings/marvell.txt @@ -115,6 +115,9 @@ prefixed with the string "marvell,", for Marvell Technology Group Ltd. - compatible : "marvell,mv64360-eth-block" - reg : Offset and length of the register set for this block + Optional properties: + - clocks : Phandle to the clock control device and gate bit + Example Discovery Ethernet block node: ethernet-block@2000 { #address-cells = <1>; diff --git a/drivers/net/ethernet/marvell/mv643xx_eth.c b/drivers/net/ethernet/marvell/mv643xx_eth.c index aedbd8256ad1..bbe61041ddac 100644 --- a/drivers/net/ethernet/marvell/mv643xx_eth.c +++ b/drivers/net/ethernet/marvell/mv643xx_eth.c @@ -268,7 +268,7 @@ struct mv643xx_eth_shared_private { int extended_rx_coal_limit; int tx_bw_control; int tx_csum_limit; - + struct clk *clk; }; #define TX_BW_CONTROL_ABSENT 0 @@ -410,9 +410,7 @@ struct mv643xx_eth_private { /* * Hardware-specific parameters. */ -#if defined(CONFIG_HAVE_CLK) struct clk *clk; -#endif unsigned int t_clk; }; @@ -2569,6 +2567,10 @@ static int mv643xx_eth_shared_probe(struct platform_device *pdev) if (msp->base == NULL) goto out_free; + msp->clk = devm_clk_get(&pdev->dev, NULL); + if (!IS_ERR(msp->clk)) + clk_prepare_enable(msp->clk); + /* * (Re-)program MBUS remapping windows if we are asked to. */ @@ -2595,6 +2597,8 @@ static int mv643xx_eth_shared_remove(struct platform_device *pdev) struct mv643xx_eth_shared_private *msp = platform_get_drvdata(pdev); iounmap(msp->base); + if (!IS_ERR(msp->clk)) + clk_disable_unprepare(msp->clk); kfree(msp); return 0; @@ -2801,13 +2805,12 @@ static int mv643xx_eth_probe(struct platform_device *pdev) * it to override the default. */ mp->t_clk = 133000000; -#if defined(CONFIG_HAVE_CLK) - mp->clk = clk_get(&pdev->dev, (pdev->id ? "1" : "0")); + mp->clk = devm_clk_get(&pdev->dev, NULL); if (!IS_ERR(mp->clk)) { clk_prepare_enable(mp->clk); mp->t_clk = clk_get_rate(mp->clk); } -#endif + set_params(mp, pd); netif_set_real_num_tx_queues(dev, mp->txq_count); netif_set_real_num_rx_queues(dev, mp->rxq_count); @@ -2889,12 +2892,8 @@ static int mv643xx_eth_probe(struct platform_device *pdev) return 0; out: -#if defined(CONFIG_HAVE_CLK) - if (!IS_ERR(mp->clk)) { + if (!IS_ERR(mp->clk)) clk_disable_unprepare(mp->clk); - clk_put(mp->clk); - } -#endif free_netdev(dev); return err; @@ -2909,12 +2908,8 @@ static int mv643xx_eth_remove(struct platform_device *pdev) phy_detach(mp->phy); cancel_work_sync(&mp->tx_timeout_task); -#if defined(CONFIG_HAVE_CLK) - if (!IS_ERR(mp->clk)) { + if (!IS_ERR(mp->clk)) clk_disable_unprepare(mp->clk); - clk_put(mp->clk); - } -#endif free_netdev(mp->dev); -- cgit v1.2.3 From 5683264c3981047aa93eebabcdbb81676018a7c9 Mon Sep 17 00:00:00 2001 From: Patrick McHardy Date: Wed, 17 Apr 2013 06:47:07 +0000 Subject: netlink: add documentation for memory mapped I/O Signed-off-by: Patrick McHardy Signed-off-by: David S. Miller --- Documentation/networking/netlink_mmap.txt | 339 ++++++++++++++++++++++++++++++ 1 file changed, 339 insertions(+) create mode 100644 Documentation/networking/netlink_mmap.txt (limited to 'Documentation') diff --git a/Documentation/networking/netlink_mmap.txt b/Documentation/networking/netlink_mmap.txt new file mode 100644 index 000000000000..1c2dab409625 --- /dev/null +++ b/Documentation/networking/netlink_mmap.txt @@ -0,0 +1,339 @@ +This file documents how to use memory mapped I/O with netlink. + +Author: Patrick McHardy + +Overview +-------- + +Memory mapped netlink I/O can be used to increase throughput and decrease +overhead of unicast receive and transmit operations. Some netlink subsystems +require high throughput, these are mainly the netfilter subsystems +nfnetlink_queue and nfnetlink_log, but it can also help speed up large +dump operations of f.i. the routing database. + +Memory mapped netlink I/O used two circular ring buffers for RX and TX which +are mapped into the processes address space. + +The RX ring is used by the kernel to directly construct netlink messages into +user-space memory without copying them as done with regular socket I/O, +additionally as long as the ring contains messages no recvmsg() or poll() +syscalls have to be issued by user-space to get more message. + +The TX ring is used to process messages directly from user-space memory, the +kernel processes all messages contained in the ring using a single sendmsg() +call. + +Usage overview +-------------- + +In order to use memory mapped netlink I/O, user-space needs three main changes: + +- ring setup +- conversion of the RX path to get messages from the ring instead of recvmsg() +- conversion of the TX path to construct messages into the ring + +Ring setup is done using setsockopt() to provide the ring parameters to the +kernel, then a call to mmap() to map the ring into the processes address space: + +- setsockopt(fd, SOL_NETLINK, NETLINK_RX_RING, ¶ms, sizeof(params)); +- setsockopt(fd, SOL_NETLINK, NETLINK_TX_RING, ¶ms, sizeof(params)); +- ring = mmap(NULL, size, PROT_READ | PROT_WRITE, MAP_SHARED, fd, 0) + +Usage of either ring is optional, but even if only the RX ring is used the +mapping still needs to be writable in order to update the frame status after +processing. + +Conversion of the reception path involves calling poll() on the file +descriptor, once the socket is readable the frames from the ring are +processsed in order until no more messages are available, as indicated by +a status word in the frame header. + +On kernel side, in order to make use of memory mapped I/O on receive, the +originating netlink subsystem needs to support memory mapped I/O, otherwise +it will use an allocated socket buffer as usual and the contents will be + copied to the ring on transmission, nullifying most of the performance gains. +Dumps of kernel databases automatically support memory mapped I/O. + +Conversion of the transmit path involves changing message contruction to +use memory from the TX ring instead of (usually) a buffer declared on the +stack and setting up the frame header approriately. Optionally poll() can +be used to wait for free frames in the TX ring. + +Structured and definitions for using memory mapped I/O are contained in +. + +RX and TX rings +---------------- + +Each ring contains a number of continous memory blocks, containing frames of +fixed size dependant on the parameters used for ring setup. + +Ring: [ block 0 ] + [ frame 0 ] + [ frame 1 ] + [ block 1 ] + [ frame 2 ] + [ frame 3 ] + ... + [ block n ] + [ frame 2 * n ] + [ frame 2 * n + 1 ] + +The blocks are only visible to the kernel, from the point of view of user-space +the ring just contains the frames in a continous memory zone. + +The ring parameters used for setting up the ring are defined as follows: + +struct nl_mmap_req { + unsigned int nm_block_size; + unsigned int nm_block_nr; + unsigned int nm_frame_size; + unsigned int nm_frame_nr; +}; + +Frames are grouped into blocks, where each block is a continous region of memory +and holds nm_block_size / nm_frame_size frames. The total number of frames in +the ring is nm_frame_nr. The following invariants hold: + +- frames_per_block = nm_block_size / nm_frame_size + +- nm_frame_nr = frames_per_block * nm_block_nr + +Some parameters are constrained, specifically: + +- nm_block_size must be a multiple of the architectures memory page size. + The getpagesize() function can be used to get the page size. + +- nm_frame_size must be equal or larger to NL_MMAP_HDRLEN, IOW a frame must be + able to hold at least the frame header + +- nm_frame_size must be smaller or equal to nm_block_size + +- nm_frame_size must be a multiple of NL_MMAP_MSG_ALIGNMENT + +- nm_frame_nr must equal the actual number of frames as specified above. + +When the kernel can't allocate phsyically continous memory for a ring block, +it will fall back to use physically discontinous memory. This might affect +performance negatively, in order to avoid this the nm_frame_size parameter +should be chosen to be as small as possible for the required frame size and +the number of blocks should be increased instead. + +Ring frames +------------ + +Each frames contain a frame header, consisting of a synchronization word and some +meta-data, and the message itself. + +Frame: [ header message ] + +The frame header is defined as follows: + +struct nl_mmap_hdr { + unsigned int nm_status; + unsigned int nm_len; + __u32 nm_group; + /* credentials */ + __u32 nm_pid; + __u32 nm_uid; + __u32 nm_gid; +}; + +- nm_status is used for synchronizing processing between the kernel and user- + space and specifies ownership of the frame as well as the operation to perform + +- nm_len contains the length of the message contained in the data area + +- nm_group specified the destination multicast group of message + +- nm_pid, nm_uid and nm_gid contain the netlink pid, UID and GID of the sending + process. These values correspond to the data available using SOCK_PASSCRED in + the SCM_CREDENTIALS cmsg. + +The possible values in the status word are: + +- NL_MMAP_STATUS_UNUSED: + RX ring: frame belongs to the kernel and contains no message + for user-space. Approriate action is to invoke poll() + to wait for new messages. + + TX ring: frame belongs to user-space and can be used for + message construction. + +- NL_MMAP_STATUS_RESERVED: + RX ring only: frame is currently used by the kernel for message + construction and contains no valid message yet. + Appropriate action is to invoke poll() to wait for + new messages. + +- NL_MMAP_STATUS_VALID: + RX ring: frame contains a valid message. Approriate action is + to process the message and release the frame back to + the kernel by setting the status to + NL_MMAP_STATUS_UNUSED or queue the frame by setting the + status to NL_MMAP_STATUS_SKIP. + + TX ring: the frame contains a valid message from user-space to + be processed by the kernel. After completing processing + the kernel will release the frame back to user-space by + setting the status to NL_MMAP_STATUS_UNUSED. + +- NL_MMAP_STATUS_COPY: + RX ring only: a message is ready to be processed but could not be + stored in the ring, either because it exceeded the + frame size or because the originating subsystem does + not support memory mapped I/O. Appropriate action is + to invoke recvmsg() to receive the message and release + the frame back to the kernel by setting the status to + NL_MMAP_STATUS_UNUSED. + +- NL_MMAP_STATUS_SKIP: + RX ring only: user-space queued the message for later processing, but + processed some messages following it in the ring. The + kernel should skip this frame when looking for unused + frames. + +The data area of a frame begins at a offset of NL_MMAP_HDRLEN relative to the +frame header. + +TX limitations +-------------- + +Kernel processing usually involves validation of the message received by +user-space, then processing its contents. The kernel must assure that +userspace is not able to modify the message contents after they have been +validated. In order to do so, the message is copied from the ring frame +to an allocated buffer if either of these conditions is false: + +- only a single mapping of the ring exists +- the file descriptor is not shared between processes + +This means that for threaded programs, the kernel will fall back to copying. + +Example +------- + +Ring setup: + + unsigned int block_size = 16 * getpagesize(); + struct nl_mmap_req req = { + .nm_block_size = block_size, + .nm_block_nr = 64, + .nm_frame_size = 16384, + .nm_frame_nr = 64 * block_size / 16384, + }; + unsigned int ring_size; + void *rx_ring, *tx_ring; + + /* Configure ring parameters */ + if (setsockopt(fd, NETLINK_RX_RING, &req, sizeof(req)) < 0) + exit(1); + if (setsockopt(fd, NETLINK_TX_RING, &req, sizeof(req)) < 0) + exit(1) + + /* Calculate size of each invididual ring */ + ring_size = req.nm_block_nr * req.nm_block_size; + + /* Map RX/TX rings. The TX ring is located after the RX ring */ + rx_ring = mmap(NULL, 2 * ring_size, PROT_READ | PROT_WRITE, + MAP_SHARED, fd, 0); + if ((long)rx_ring == -1L) + exit(1); + tx_ring = rx_ring + ring_size: + +Message reception: + +This example assumes some ring parameters of the ring setup are available. + + unsigned int frame_offset = 0; + struct nl_mmap_hdr *hdr; + struct nlmsghdr *nlh; + unsigned char buf[16384]; + ssize_t len; + + while (1) { + struct pollfd pfds[1]; + + pfds[0].fd = fd; + pfds[0].events = POLLIN | POLLERR; + pfds[0].revents = 0; + + if (poll(pfds, 1, -1) < 0 && errno != -EINTR) + exit(1); + + /* Check for errors. Error handling omitted */ + if (pfds[0].revents & POLLERR) + + + /* If no new messages, poll again */ + if (!(pfds[0].revents & POLLIN)) + continue; + + /* Process all frames */ + while (1) { + /* Get next frame header */ + hdr = rx_ring + frame_offset; + + if (hdr->nm_status == NL_MMAP_STATUS_VALID) + /* Regular memory mapped frame */ + nlh = (void *hdr) + NL_MMAP_HDRLEN; + len = hdr->nm_len; + + /* Release empty message immediately. May happen + * on error during message construction. + */ + if (len == 0) + goto release; + } else if (hdr->nm_status == NL_MMAP_STATUS_COPY) { + /* Frame queued to socket receive queue */ + len = recv(fd, buf, sizeof(buf), MSG_DONTWAIT); + if (len <= 0) + break; + nlh = buf; + } else + /* No more messages to process, continue polling */ + break; + + process_msg(nlh); +release: + /* Release frame back to the kernel */ + hdr->nm_status = NL_MMAP_STATUS_UNUSED; + + /* Advance frame offset to next frame */ + frame_offset = (frame_offset + frame_size) % ring_size; + } + } + +Message transmission: + +This example assumes some ring parameters of the ring setup are available. +A single message is constructed and transmitted, to send multiple messages +at once they would be constructed in consecutive frames before a final call +to sendto(). + + unsigned int frame_offset = 0; + struct nl_mmap_hdr *hdr; + struct nlmsghdr *nlh; + struct sockaddr_nl addr = { + .nl_family = AF_NETLINK, + }; + + hdr = tx_ring + frame_offset; + if (hdr->nm_status != NL_MMAP_STATUS_UNUSED) + /* No frame available. Use poll() to avoid. */ + exit(1); + + nlh = (void *)hdr + NL_MMAP_HDRLEN; + + /* Build message */ + build_message(nlh); + + /* Fill frame header: length and status need to be set */ + hdr->nm_len = nlh->nlmsg_len; + hdr->nm_status = NL_MMAP_STATUS_VALID; + + if (sendto(fd, NULL, 0, 0, &addr, sizeof(addr)) < 0) + exit(1); + + /* Advance frame offset to next frame */ + frame_offset = (frame_offset + frame_size) % ring_size; -- cgit v1.2.3 From 2940b26bec9fe5bf183c994678e62b55d35717e6 Mon Sep 17 00:00:00 2001 From: Daniel Borkmann Date: Tue, 23 Apr 2013 00:39:32 +0000 Subject: packet: doc: update timestamping part Bring the timestamping section in sync with the implementation. Signed-off-by: Daniel Borkmann Acked-by: Willem de Bruijn Signed-off-by: David S. Miller --- Documentation/networking/packet_mmap.txt | 41 +++++++++++++++++++++++++++----- 1 file changed, 35 insertions(+), 6 deletions(-) (limited to 'Documentation') diff --git a/Documentation/networking/packet_mmap.txt b/Documentation/networking/packet_mmap.txt index 65efb85e49de..23dd80e82b8e 100644 --- a/Documentation/networking/packet_mmap.txt +++ b/Documentation/networking/packet_mmap.txt @@ -1016,10 +1016,11 @@ retry_block: ------------------------------------------------------------------------------- The PACKET_TIMESTAMP setting determines the source of the timestamp in -the packet meta information. If your NIC is capable of timestamping -packets in hardware, you can request those hardware timestamps to used. -Note: you may need to enable the generation of hardware timestamps with -SIOCSHWTSTAMP. +the packet meta information for mmap(2)ed RX_RING and TX_RINGs. If your +NIC is capable of timestamping packets in hardware, you can request those +hardware timestamps to be used. Note: you may need to enable the generation +of hardware timestamps with SIOCSHWTSTAMP (see related information from +Documentation/networking/timestamping.txt). PACKET_TIMESTAMP accepts the same integer bit field as SO_TIMESTAMPING. However, only the SOF_TIMESTAMPING_SYS_HARDWARE @@ -1031,8 +1032,36 @@ SOF_TIMESTAMPING_RAW_HARDWARE if both bits are set. req |= SOF_TIMESTAMPING_SYS_HARDWARE; setsockopt(fd, SOL_PACKET, PACKET_TIMESTAMP, (void *) &req, sizeof(req)) -If PACKET_TIMESTAMP is not set, a software timestamp generated inside -the networking stack is used (the behavior before this setting was added). +For the mmap(2)ed ring buffers, such timestamps are stored in the +tpacket{,2,3}_hdr structure's tp_sec and tp_{n,u}sec members. To determine +what kind of timestamp has been reported, the tp_status field is binary |'ed +with the following possible bits ... + + TP_STATUS_TS_SYS_HARDWARE + TP_STATUS_TS_RAW_HARDWARE + TP_STATUS_TS_SOFTWARE + +... that are equivalent to its SOF_TIMESTAMPING_* counterparts. For the +RX_RING, if none of those 3 are set (i.e. PACKET_TIMESTAMP is not set), +then this means that a software fallback was invoked *within* PF_PACKET's +processing code (less precise). + +Getting timestamps for the TX_RING works as follows: i) fill the ring frames, +ii) call sendto() e.g. in blocking mode, iii) wait for status of relevant +frames to be updated resp. the frame handed over to the application, iv) walk +through the frames to pick up the individual hw/sw timestamps. + +Only (!) if transmit timestamping is enabled, then these bits are combined +with binary | with TP_STATUS_AVAILABLE, so you must check for that in your +application (e.g. !(tp_status & (TP_STATUS_SEND_REQUEST | TP_STATUS_SENDING)) +in a first step to see if the frame belongs to the application, and then +one can extract the type of timestamp in a second step from tp_status)! + +If you don't care about them, thus having it disabled, checking for +TP_STATUS_AVAILABLE resp. TP_STATUS_WRONG_FORMAT is sufficient. If in the +TX_RING part only TP_STATUS_AVAILABLE is set, then the tp_sec and tp_{n,u}sec +members do not contain a valid value. For TX_RINGs, by default no timestamp +is generated! See include/linux/net_tstamp.h and Documentation/networking/timestamping for more information on hardware timestamps. -- cgit v1.2.3