diff options
author | Linus Torvalds <torvalds@linux-foundation.org> | 2014-06-12 14:27:40 -0700 |
---|---|---|
committer | Linus Torvalds <torvalds@linux-foundation.org> | 2014-06-12 14:27:40 -0700 |
commit | f9da455b93f6ba076935b4ef4589f61e529ae046 (patch) | |
tree | 3c4e69ce1ba1d6bf65915b97a76ca2172105b278 /Documentation | |
parent | 0e04c641b199435f3779454055f6a7de258ecdfc (diff) | |
parent | e5eca6d41f53db48edd8cf88a3f59d2c30227f8e (diff) | |
download | lwn-f9da455b93f6ba076935b4ef4589f61e529ae046.tar.gz lwn-f9da455b93f6ba076935b4ef4589f61e529ae046.zip |
Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next
Pull networking updates from David Miller:
1) Seccomp BPF filters can now be JIT'd, from Alexei Starovoitov.
2) Multiqueue support in xen-netback and xen-netfront, from Andrew J
Benniston.
3) Allow tweaking of aggregation settings in cdc_ncm driver, from Bjørn
Mork.
4) BPF now has a "random" opcode, from Chema Gonzalez.
5) Add more BPF documentation and improve test framework, from Daniel
Borkmann.
6) Support TCP fastopen over ipv6, from Daniel Lee.
7) Add software TSO helper functions and use them to support software
TSO in mvneta and mv643xx_eth drivers. From Ezequiel Garcia.
8) Support software TSO in fec driver too, from Nimrod Andy.
9) Add Broadcom SYSTEMPORT driver, from Florian Fainelli.
10) Handle broadcasts more gracefully over macvlan when there are large
numbers of interfaces configured, from Herbert Xu.
11) Allow more control over fwmark used for non-socket based responses,
from Lorenzo Colitti.
12) Do TCP congestion window limiting based upon measurements, from Neal
Cardwell.
13) Support busy polling in SCTP, from Neal Horman.
14) Allow RSS key to be configured via ethtool, from Venkata Duvvuru.
15) Bridge promisc mode handling improvements from Vlad Yasevich.
16) Don't use inetpeer entries to implement ID generation any more, it
performs poorly, from Eric Dumazet.
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next: (1522 commits)
rtnetlink: fix userspace API breakage for iproute2 < v3.9.0
tcp: fixing TLP's FIN recovery
net: fec: Add software TSO support
net: fec: Add Scatter/gather support
net: fec: Increase buffer descriptor entry number
net: fec: Factorize feature setting
net: fec: Enable IP header hardware checksum
net: fec: Factorize the .xmit transmit function
bridge: fix compile error when compiling without IPv6 support
bridge: fix smatch warning / potential null pointer dereference
via-rhine: fix full-duplex with autoneg disable
bnx2x: Enlarge the dorq threshold for VFs
bnx2x: Check for UNDI in uncommon branch
bnx2x: Fix 1G-baseT link
bnx2x: Fix link for KR with swapped polarity lane
sctp: Fix sk_ack_backlog wrap-around problem
net/core: Add VF link state control policy
net/fsl: xgmac_mdio is dependent on OF_MDIO
net/fsl: Make xgmac_mdio read error message useful
net_sched: drr: warn when qdisc is not work conserving
...
Diffstat (limited to 'Documentation')
27 files changed, 1646 insertions, 108 deletions
diff --git a/Documentation/ABI/testing/sysfs-class-net b/Documentation/ABI/testing/sysfs-class-net index d922060e455d..416c5d59f52e 100644 --- a/Documentation/ABI/testing/sysfs-class-net +++ b/Documentation/ABI/testing/sysfs-class-net @@ -169,6 +169,14 @@ Description: "unknown", "notpresent", "down", "lowerlayerdown", "testing", "dormant", "up". +What: /sys/class/net/<iface>/phys_port_id +Date: July 2013 +KernelVersion: 3.12 +Contact: netdev@vger.kernel.org +Description: + Indicates the interface unique physical port identifier within + the NIC, as a string. + What: /sys/class/net/<iface>/speed Date: October 2009 KernelVersion: 2.6.33 diff --git a/Documentation/ABI/testing/sysfs-class-net-cdc_ncm b/Documentation/ABI/testing/sysfs-class-net-cdc_ncm new file mode 100644 index 000000000000..5cedf72df358 --- /dev/null +++ b/Documentation/ABI/testing/sysfs-class-net-cdc_ncm @@ -0,0 +1,149 @@ +What: /sys/class/net/<iface>/cdc_ncm/min_tx_pkt +Date: May 2014 +KernelVersion: 3.16 +Contact: Bjørn Mork <bjorn@mork.no> +Description: + The driver will pad NCM Transfer Blocks (NTBs) longer + than this to tx_max, allowing the device to receive + tx_max sized frames with no terminating short + packet. NTBs shorter than this limit are transmitted + as-is, without any padding, and are terminated with a + short USB packet. + + Padding to tx_max allows the driver to transmit NTBs + back-to-back without any interleaving short USB + packets. This reduces the number of short packet + interrupts in the device, and represents a tradeoff + between USB bus bandwidth and device DMA optimization. + + Set to 0 to pad all frames. Set greater than tx_max to + disable all padding. + +What: /sys/class/net/<iface>/cdc_ncm/rx_max +Date: May 2014 +KernelVersion: 3.16 +Contact: Bjørn Mork <bjorn@mork.no> +Description: + The maximum NTB size for RX. Cannot exceed the + maximum value supported by the device. Must allow at + least one max sized datagram plus headers. + + The actual limits are device dependent. See + dwNtbInMaxSize. + + Note: Some devices will silently ignore changes to + this value, resulting in oversized NTBs and + corresponding framing errors. + +What: /sys/class/net/<iface>/cdc_ncm/tx_max +Date: May 2014 +KernelVersion: 3.16 +Contact: Bjørn Mork <bjorn@mork.no> +Description: + The maximum NTB size for TX. Cannot exceed the + maximum value supported by the device. Must allow at + least one max sized datagram plus headers. + + The actual limits are device dependent. See + dwNtbOutMaxSize. + +What: /sys/class/net/<iface>/cdc_ncm/tx_timer_usecs +Date: May 2014 +KernelVersion: 3.16 +Contact: Bjørn Mork <bjorn@mork.no> +Description: + Datagram aggregation timeout in µs. The driver will + wait up to 3 times this timeout for more datagrams to + aggregate before transmitting an NTB frame. + + Valid range: 5 to 4000000 + + Set to 0 to disable aggregation. + +The following read-only attributes all represent fields of the +structure defined in section 6.2.1 "GetNtbParameters" of "Universal +Serial Bus Communications Class Subclass Specifications for Network +Control Model Devices" (CDC NCM), Revision 1.0 (Errata 1), November +24, 2010 from USB Implementers Forum, Inc. The descriptions are +quoted from table 6-3 of CDC NCM: "NTB Parameter Structure". + +What: /sys/class/net/<iface>/cdc_ncm/bmNtbFormatsSupported +Date: May 2014 +KernelVersion: 3.16 +Contact: Bjørn Mork <bjorn@mork.no> +Description: + Bit 0: 16-bit NTB supported (set to 1) + Bit 1: 32-bit NTB supported + Bits 2 – 15: reserved (reset to zero; must be ignored by host) + +What: /sys/class/net/<iface>/cdc_ncm/dwNtbInMaxSize +Date: May 2014 +KernelVersion: 3.16 +Contact: Bjørn Mork <bjorn@mork.no> +Description: + IN NTB Maximum Size in bytes + +What: /sys/class/net/<iface>/cdc_ncm/wNdpInDivisor +Date: May 2014 +KernelVersion: 3.16 +Contact: Bjørn Mork <bjorn@mork.no> +Description: + Divisor used for IN NTB Datagram payload alignment + +What: /sys/class/net/<iface>/cdc_ncm/wNdpInPayloadRemainder +Date: May 2014 +KernelVersion: 3.16 +Contact: Bjørn Mork <bjorn@mork.no> +Description: + Remainder used to align input datagram payload within + the NTB: (Payload Offset) mod (wNdpInDivisor) = + wNdpInPayloadRemainder + +What: /sys/class/net/<iface>/cdc_ncm/wNdpInAlignment +Date: May 2014 +KernelVersion: 3.16 +Contact: Bjørn Mork <bjorn@mork.no> +Description: + NDP alignment modulus for NTBs on the IN pipe. Shall + be a power of 2, and shall be at least 4. + +What: /sys/class/net/<iface>/cdc_ncm/dwNtbOutMaxSize +Date: May 2014 +KernelVersion: 3.16 +Contact: Bjørn Mork <bjorn@mork.no> +Description: + OUT NTB Maximum Size + +What: /sys/class/net/<iface>/cdc_ncm/wNdpOutDivisor +Date: May 2014 +KernelVersion: 3.16 +Contact: Bjørn Mork <bjorn@mork.no> +Description: + OUT NTB Datagram alignment modulus + +What: /sys/class/net/<iface>/cdc_ncm/wNdpOutPayloadRemainder +Date: May 2014 +KernelVersion: 3.16 +Contact: Bjørn Mork <bjorn@mork.no> +Description: + Remainder used to align output datagram payload + offsets within the NTB: Padding, shall be transmitted + as zero by function, and ignored by host. (Payload + Offset) mod (wNdpOutDivisor) = wNdpOutPayloadRemainder + +What: /sys/class/net/<iface>/cdc_ncm/wNdpOutAlignment +Date: May 2014 +KernelVersion: 3.16 +Contact: Bjørn Mork <bjorn@mork.no> +Description: + NDP alignment modulus for use in NTBs on the OUT + pipe. Shall be a power of 2, and shall be at least 4. + +What: /sys/class/net/<iface>/cdc_ncm/wNtbOutMaxDatagrams +Date: May 2014 +KernelVersion: 3.16 +Contact: Bjørn Mork <bjorn@mork.no> +Description: + Maximum number of datagrams that the host may pack + into a single OUT NTB. Zero means that the device + imposes no limit. diff --git a/Documentation/ABI/testing/sysfs-class-net-queues b/Documentation/ABI/testing/sysfs-class-net-queues new file mode 100644 index 000000000000..5e9aeb91d355 --- /dev/null +++ b/Documentation/ABI/testing/sysfs-class-net-queues @@ -0,0 +1,79 @@ +What: /sys/class/<iface>/queues/rx-<queue>/rps_cpus +Date: March 2010 +KernelVersion: 2.6.35 +Contact: netdev@vger.kernel.org +Description: + Mask of the CPU(s) currently enabled to participate into the + Receive Packet Steering packet processing flow for this + network device queue. Possible values depend on the number + of available CPU(s) in the system. + +What: /sys/class/<iface>/queues/rx-<queue>/rps_flow_cnt +Date: April 2010 +KernelVersion: 2.6.35 +Contact: netdev@vger.kernel.org +Description: + Number of Receive Packet Steering flows being currently + processed by this particular network device receive queue. + +What: /sys/class/<iface>/queues/tx-<queue>/tx_timeout +Date: November 2011 +KernelVersion: 3.3 +Contact: netdev@vger.kernel.org +Description: + Indicates the number of transmit timeout events seen by this + network interface transmit queue. + +What: /sys/class/<iface>/queues/tx-<queue>/xps_cpus +Date: November 2010 +KernelVersion: 2.6.38 +Contact: netdev@vger.kernel.org +Description: + Mask of the CPU(s) currently enabled to participate into the + Transmit Packet Steering packet processing flow for this + network device transmit queue. Possible vaules depend on the + number of available CPU(s) in the system. + +What: /sys/class/<iface>/queues/tx-<queue>/byte_queue_limits/hold_time +Date: November 2011 +KernelVersion: 3.3 +Contact: netdev@vger.kernel.org +Description: + Indicates the hold time in milliseconds to measure the slack + of this particular network device transmit queue. + Default value is 1000. + +What: /sys/class/<iface>/queues/tx-<queue>/byte_queue_limits/inflight +Date: November 2011 +KernelVersion: 3.3 +Contact: netdev@vger.kernel.org +Description: + Indicates the number of bytes (objects) in flight on this + network device transmit queue. + +What: /sys/class/<iface>/queues/tx-<queue>/byte_queue_limits/limit +Date: November 2011 +KernelVersion: 3.3 +Contact: netdev@vger.kernel.org +Description: + Indicates the current limit of bytes allowed to be queued + on this network device transmit queue. This value is clamped + to be within the bounds defined by limit_max and limit_min. + +What: /sys/class/<iface>/queues/tx-<queue>/byte_queue_limits/limit_max +Date: November 2011 +KernelVersion: 3.3 +Contact: netdev@vger.kernel.org +Description: + Indicates the absolute maximum limit of bytes allowed to be + queued on this network device transmit queue. See + include/linux/dynamic_queue_limits.h for the default value. + +What: /sys/class/<iface>/queues/tx-<queue>/byte_queue_limits/limit_min +Date: November 2011 +KernelVersion: 3.3 +Contact: netdev@vger.kernel.org +Description: + Indicates the absolute minimum limit of bytes allowed to be + queued on this network device transmit queue. Default value is + 0. diff --git a/Documentation/ABI/testing/sysfs-class-net-statistics b/Documentation/ABI/testing/sysfs-class-net-statistics new file mode 100644 index 000000000000..397118de7b5e --- /dev/null +++ b/Documentation/ABI/testing/sysfs-class-net-statistics @@ -0,0 +1,201 @@ +What: /sys/class/<iface>/statistics/collisions +Date: April 2005 +KernelVersion: 2.6.12 +Contact: netdev@vger.kernel.org +Description: + Indicates the number of collisions seen by this network device. + This value might not be relevant with all MAC layers. + +What: /sys/class/<iface>/statistics/multicast +Date: April 2005 +KernelVersion: 2.6.12 +Contact: netdev@vger.kernel.org +Description: + Indicates the number of multicast packets received by this + network device. + +What: /sys/class/<iface>/statistics/rx_bytes +Date: April 2005 +KernelVersion: 2.6.12 +Contact: netdev@vger.kernel.org +Description: + Indicates the number of bytes received by this network device. + See the network driver for the exact meaning of when this + value is incremented. + +What: /sys/class/<iface>/statistics/rx_compressed +Date: April 2005 +KernelVersion: 2.6.12 +Contact: netdev@vger.kernel.org +Description: + Indicates the number of compressed packets received by this + network device. This value might only be relevant for interfaces + that support packet compression (e.g: PPP). + +What: /sys/class/<iface>/statistics/rx_crc_errors +Date: April 2005 +KernelVersion: 2.6.12 +Contact: netdev@vger.kernel.org +Description: + Indicates the number of packets received with a CRC (FCS) error + by this network device. Note that the specific meaning might + depend on the MAC layer used by the interface. + +What: /sys/class/<iface>/statistics/rx_dropped +Date: April 2005 +KernelVersion: 2.6.12 +Contact: netdev@vger.kernel.org +Description: + Indicates the number of packets received by the network device + but dropped, that are not forwarded to the upper layers for + packet processing. See the network driver for the exact + meaning of this value. + +What: /sys/class/<iface>/statistics/rx_fifo_errors +Date: April 2005 +KernelVersion: 2.6.12 +Contact: netdev@vger.kernel.org +Description: + Indicates the number of receive FIFO errors seen by this + network device. See the network driver for the exact + meaning of this value. + +What: /sys/class/<iface>/statistics/rx_frame_errors +Date: April 2005 +KernelVersion: 2.6.12 +Contact: netdev@vger.kernel.org +Description: + Indicates the number of received frames with error, such as + alignment errors. Note that the specific meaning depends on + on the MAC layer protocol used. See the network driver for + the exact meaning of this value. + +What: /sys/class/<iface>/statistics/rx_length_errors +Date: April 2005 +KernelVersion: 2.6.12 +Contact: netdev@vger.kernel.org +Description: + Indicates the number of received error packet with a length + error, oversized or undersized. See the network driver for the + exact meaning of this value. + +What: /sys/class/<iface>/statistics/rx_missed_errors +Date: April 2005 +KernelVersion: 2.6.12 +Contact: netdev@vger.kernel.org +Description: + Indicates the number of received packets that have been missed + due to lack of capacity in the receive side. See the network + driver for the exact meaning of this value. + +What: /sys/class/<iface>/statistics/rx_over_errors +Date: April 2005 +KernelVersion: 2.6.12 +Contact: netdev@vger.kernel.org +Description: + Indicates the number of received packets that are oversized + compared to what the network device is configured to accept + (e.g: larger than MTU). See the network driver for the exact + meaning of this value. + +What: /sys/class/<iface>/statistics/rx_packets +Date: April 2005 +KernelVersion: 2.6.12 +Contact: netdev@vger.kernel.org +Description: + Indicates the total number of good packets received by this + network device. + +What: /sys/class/<iface>/statistics/tx_aborted_errors +Date: April 2005 +KernelVersion: 2.6.12 +Contact: netdev@vger.kernel.org +Description: + Indicates the number of packets that have been aborted + during transmission by a network device (e.g: because of + a medium collision). See the network driver for the exact + meaning of this value. + +What: /sys/class/<iface>/statistics/tx_bytes +Date: April 2005 +KernelVersion: 2.6.12 +Contact: netdev@vger.kernel.org +Description: + Indicates the number of bytes transmitted by a network + device. See the network driver for the exact meaning of this + value, in particular whether this accounts for all successfully + transmitted packets or all packets that have been queued for + transmission. + +What: /sys/class/<iface>/statistics/tx_carrier_errors +Date: April 2005 +KernelVersion: 2.6.12 +Contact: netdev@vger.kernel.org +Description: + Indicates the number of packets that could not be transmitted + because of carrier errors (e.g: physical link down). See the + network driver for the exact meaning of this value. + +What: /sys/class/<iface>/statistics/tx_compressed +Date: April 2005 +KernelVersion: 2.6.12 +Contact: netdev@vger.kernel.org +Description: + Indicates the number of transmitted compressed packets. Note + this might only be relevant for devices that support + compression (e.g: PPP). + +What: /sys/class/<iface>/statistics/tx_dropped +Date: April 2005 +KernelVersion: 2.6.12 +Contact: netdev@vger.kernel.org +Description: + Indicates the number of packets dropped during transmission. + See the driver for the exact reasons as to why the packets were + dropped. + +What: /sys/class/<iface>/statistics/tx_errors +Date: April 2005 +KernelVersion: 2.6.12 +Contact: netdev@vger.kernel.org +Description: + Indicates the number of packets in error during transmission by + a network device. See the driver for the exact reasons as to + why the packets were dropped. + +What: /sys/class/<iface>/statistics/tx_fifo_errors +Date: April 2005 +KernelVersion: 2.6.12 +Contact: netdev@vger.kernel.org +Description: + Indicates the number of packets having caused a transmit + FIFO error. See the driver for the exact reasons as to why the + packets were dropped. + +What: /sys/class/<iface>/statistics/tx_heartbeat_errors +Date: April 2005 +KernelVersion: 2.6.12 +Contact: netdev@vger.kernel.org +Description: + Indicates the number of packets transmitted that have been + reported as heartbeat errors. See the driver for the exact + reasons as to why the packets were dropped. + +What: /sys/class/<iface>/statistics/tx_packets +Date: April 2005 +KernelVersion: 2.6.12 +Contact: netdev@vger.kernel.org +Description: + Indicates the number of packets transmitted by a network + device. See the driver for whether this reports the number of all + attempted or successful transmissions. + +What: /sys/class/<iface>/statistics/tx_window_errors +Date: April 2005 +KernelVersion: 2.6.12 +Contact: netdev@vger.kernel.org +Description: + Indicates the number of packets not successfully transmitted + due to a window collision. The specific meaning depends on the + MAC layer used. On Ethernet this is usually used to report + late collisions errors. diff --git a/Documentation/DocBook/80211.tmpl b/Documentation/DocBook/80211.tmpl index 044b76436e83..d9b9416c989f 100644 --- a/Documentation/DocBook/80211.tmpl +++ b/Documentation/DocBook/80211.tmpl @@ -100,6 +100,7 @@ !Finclude/net/cfg80211.h wdev_priv !Finclude/net/cfg80211.h ieee80211_iface_limit !Finclude/net/cfg80211.h ieee80211_iface_combination +!Finclude/net/cfg80211.h cfg80211_check_combinations </chapter> <chapter> <title>Actions and configuration</title> diff --git a/Documentation/devicetree/bindings/net/amd-xgbe-phy.txt b/Documentation/devicetree/bindings/net/amd-xgbe-phy.txt new file mode 100644 index 000000000000..d01ed63d3ebb --- /dev/null +++ b/Documentation/devicetree/bindings/net/amd-xgbe-phy.txt @@ -0,0 +1,17 @@ +* AMD 10GbE PHY driver (amd-xgbe-phy) + +Required properties: +- compatible: Should be "amd,xgbe-phy-seattle-v1a" and + "ethernet-phy-ieee802.3-c45" +- reg: Address and length of the register sets for the device + - SerDes Rx/Tx registers + - SerDes integration registers (1/2) + - SerDes integration registers (2/2) + +Example: + xgbe_phy@e1240800 { + compatible = "amd,xgbe-phy-seattle-v1a", "ethernet-phy-ieee802.3-c45"; + reg = <0 0xe1240800 0 0x00400>, + <0 0xe1250000 0 0x00060>, + <0 0xe1250080 0 0x00004>; + }; diff --git a/Documentation/devicetree/bindings/net/amd-xgbe.txt b/Documentation/devicetree/bindings/net/amd-xgbe.txt new file mode 100644 index 000000000000..ea0c7908a3b8 --- /dev/null +++ b/Documentation/devicetree/bindings/net/amd-xgbe.txt @@ -0,0 +1,34 @@ +* AMD 10GbE driver (amd-xgbe) + +Required properties: +- compatible: Should be "amd,xgbe-seattle-v1a" +- reg: Address and length of the register sets for the device + - MAC registers + - PCS registers +- interrupt-parent: Should be the phandle for the interrupt controller + that services interrupts for this device +- interrupts: Should contain the amd-xgbe interrupt +- clocks: Should be the DMA clock for the amd-xgbe device (used for + calculating the correct Rx interrupt watchdog timer value on a DMA + channel for coalescing) +- clock-names: Should be the name of the DMA clock, "dma_clk" +- phy-handle: See ethernet.txt file in the same directory +- phy-mode: See ethernet.txt file in the same directory + +Optional properties: +- mac-address: mac address to be assigned to the device. Can be overridden + by UEFI. + +Example: + xgbe@e0700000 { + compatible = "amd,xgbe-seattle-v1a"; + reg = <0 0xe0700000 0 0x80000>, + <0 0xe0780000 0 0x80000>; + interrupt-parent = <&gic>; + interrupts = <0 325 4>; + clocks = <&xgbe_clk>; + clock-names = "dma_clk"; + phy-handle = <&phy>; + phy-mode = "xgmii"; + mac-address = [ 02 a1 a2 a3 a4 a5 ]; + }; diff --git a/Documentation/devicetree/bindings/net/broadcom-bcmgenet.txt b/Documentation/devicetree/bindings/net/broadcom-bcmgenet.txt index f2febb94550e..451fef26b4df 100644 --- a/Documentation/devicetree/bindings/net/broadcom-bcmgenet.txt +++ b/Documentation/devicetree/bindings/net/broadcom-bcmgenet.txt @@ -24,7 +24,7 @@ Optional properties: - fixed-link: When the GENET interface is connected to a MoCA hardware block or when operating in a RGMII to RGMII type of connection, or when the MDIO bus is voluntarily disabled, this property should be used to describe the "fixed link". - See Documentation/devicetree/bindings/net/fsl-tsec-phy.txt for information on + See Documentation/devicetree/bindings/net/fixed-link.txt for information on the property specifics Required child nodes: diff --git a/Documentation/devicetree/bindings/net/broadcom-systemport.txt b/Documentation/devicetree/bindings/net/broadcom-systemport.txt new file mode 100644 index 000000000000..c183ea90d9bc --- /dev/null +++ b/Documentation/devicetree/bindings/net/broadcom-systemport.txt @@ -0,0 +1,29 @@ +* Broadcom BCM7xxx Ethernet Systemport Controller (SYSTEMPORT) + +Required properties: +- compatible: should be one of "brcm,systemport-v1.00" or "brcm,systemport" +- reg: address and length of the register set for the device. +- interrupts: interrupts for the device, first cell must be for the the rx + interrupts, and the second cell should be for the transmit queues +- local-mac-address: Ethernet MAC address (48 bits) of this adapter +- phy-mode: Should be a string describing the PHY interface to the + Ethernet switch/PHY, see Documentation/devicetree/bindings/net/ethernet.txt +- fixed-link: see Documentation/devicetree/bindings/net/fixed-link.txt for + the property specific details + +Optional properties: +- systemport,num-tier2-arb: number of tier 2 arbiters, an integer +- systemport,num-tier1-arb: number of tier 1 arbiters, an integer +- systemport,num-txq: number of HW transmit queues, an integer +- systemport,num-rxq: number of HW receive queues, an integer + +Example: +ethernet@f04a0000 { + compatible = "brcm,systemport-v1.00"; + reg = <0xf04a0000 0x4650>; + local-mac-address = [ 00 11 22 33 44 55 ]; + fixed-link = <0 1 1000 0 0>; + phy-mode = "gmii"; + interrupts = <0x0 0x16 0x0>, + <0x0 0x17 0x0>; +}; diff --git a/Documentation/devicetree/bindings/net/can/xilinx_can.txt b/Documentation/devicetree/bindings/net/can/xilinx_can.txt new file mode 100644 index 000000000000..fe38847d8e26 --- /dev/null +++ b/Documentation/devicetree/bindings/net/can/xilinx_can.txt @@ -0,0 +1,44 @@ +Xilinx Axi CAN/Zynq CANPS controller Device Tree Bindings +--------------------------------------------------------- + +Required properties: +- compatible : Should be "xlnx,zynq-can-1.0" for Zynq CAN + controllers and "xlnx,axi-can-1.00.a" for Axi CAN + controllers. +- reg : Physical base address and size of the Axi CAN/Zynq + CANPS registers map. +- interrupts : Property with a value describing the interrupt + number. +- interrupt-parent : Must be core interrupt controller +- clock-names : List of input clock names - "can_clk", "pclk" + (For CANPS), "can_clk" , "s_axi_aclk"(For AXI CAN) + (See clock bindings for details). +- clocks : Clock phandles (see clock bindings for details). +- tx-fifo-depth : Can Tx fifo depth. +- rx-fifo-depth : Can Rx fifo depth. + + +Example: + +For Zynq CANPS Dts file: + zynq_can_0: can@e0008000 { + compatible = "xlnx,zynq-can-1.0"; + clocks = <&clkc 19>, <&clkc 36>; + clock-names = "can_clk", "pclk"; + reg = <0xe0008000 0x1000>; + interrupts = <0 28 4>; + interrupt-parent = <&intc>; + tx-fifo-depth = <0x40>; + rx-fifo-depth = <0x40>; + }; +For Axi CAN Dts file: + axi_can_0: axi-can@40000000 { + compatible = "xlnx,axi-can-1.00.a"; + clocks = <&clkc 0>, <&clkc 1>; + clock-names = "can_clk","s_axi_aclk" ; + reg = <0x40000000 0x10000>; + interrupt-parent = <&intc>; + interrupts = <0 59 1>; + tx-fifo-depth = <0x40>; + rx-fifo-depth = <0x40>; + }; diff --git a/Documentation/devicetree/bindings/net/cpsw-phy-sel.txt b/Documentation/devicetree/bindings/net/cpsw-phy-sel.txt index 7ff57a119f81..764c0c79b43d 100644 --- a/Documentation/devicetree/bindings/net/cpsw-phy-sel.txt +++ b/Documentation/devicetree/bindings/net/cpsw-phy-sel.txt @@ -2,7 +2,9 @@ TI CPSW Phy mode Selection Device Tree Bindings ----------------------------------------------- Required properties: -- compatible : Should be "ti,am3352-cpsw-phy-sel" +- compatible : Should be "ti,am3352-cpsw-phy-sel" for am335x platform and + "ti,dra7xx-cpsw-phy-sel" for dra7xx platform + "ti,am43xx-cpsw-phy-sel" for am43xx platform - reg : physical base address and size of the cpsw registers map - reg-names : names of the register map given in "reg" node diff --git a/Documentation/devicetree/bindings/net/fixed-link.txt b/Documentation/devicetree/bindings/net/fixed-link.txt new file mode 100644 index 000000000000..82bf7e0f47b6 --- /dev/null +++ b/Documentation/devicetree/bindings/net/fixed-link.txt @@ -0,0 +1,42 @@ +Fixed link Device Tree binding +------------------------------ + +Some Ethernet MACs have a "fixed link", and are not connected to a +normal MDIO-managed PHY device. For those situations, a Device Tree +binding allows to describe a "fixed link". + +Such a fixed link situation is described by creating a 'fixed-link' +sub-node of the Ethernet MAC device node, with the following +properties: + +* 'speed' (integer, mandatory), to indicate the link speed. Accepted + values are 10, 100 and 1000 +* 'full-duplex' (boolean, optional), to indicate that full duplex is + used. When absent, half duplex is assumed. +* 'pause' (boolean, optional), to indicate that pause should be + enabled. +* 'asym-pause' (boolean, optional), to indicate that asym_pause should + be enabled. + +Old, deprecated 'fixed-link' binding: + +* A 'fixed-link' property in the Ethernet MAC node, with 5 cells, of the + form <a b c d e> with the following accepted values: + - a: emulated PHY ID, choose any but but unique to the all specified + fixed-links, from 0 to 31 + - b: duplex configuration: 0 for half duplex, 1 for full duplex + - c: link speed in Mbits/sec, accepted values are: 10, 100 and 1000 + - d: pause configuration: 0 for no pause, 1 for pause + - e: asymmetric pause configuration: 0 for no asymmetric pause, 1 for + asymmetric pause + +Example: + +ethernet@0 { + ... + fixed-link { + speed = <1000>; + full-duplex; + }; + ... +}; diff --git a/Documentation/devicetree/bindings/net/fsl-tsec-phy.txt b/Documentation/devicetree/bindings/net/fsl-tsec-phy.txt index 737cdef4f903..be6ea8960f20 100644 --- a/Documentation/devicetree/bindings/net/fsl-tsec-phy.txt +++ b/Documentation/devicetree/bindings/net/fsl-tsec-phy.txt @@ -42,10 +42,7 @@ Properties: interrupt. For TSEC and eTSEC devices, the first interrupt is transmit, the second is receive, and the third is error. - phy-handle : See ethernet.txt file in the same directory. - - fixed-link : <a b c d e> where a is emulated phy id - choose any, - but unique to the all specified fixed-links, b is duplex - 0 half, - 1 full, c is link speed - d#10/d#100/d#1000, d is pause - 0 no - pause, 1 pause, e is asym_pause - 0 no asym_pause, 1 asym_pause. + - fixed-link : See fixed-link.txt in the same directory. - phy-connection-type : See ethernet.txt file in the same directory. This property is only really needed if the connection is of type "rgmii-id", as all other connection types are detected by hardware. diff --git a/Documentation/devicetree/bindings/net/hisilicon-hix5hd2-gmac.txt b/Documentation/devicetree/bindings/net/hisilicon-hix5hd2-gmac.txt new file mode 100644 index 000000000000..75d398bb1fbb --- /dev/null +++ b/Documentation/devicetree/bindings/net/hisilicon-hix5hd2-gmac.txt @@ -0,0 +1,36 @@ +Hisilicon hix5hd2 gmac controller + +Required properties: +- compatible: should be "hisilicon,hix5hd2-gmac". +- reg: specifies base physical address(s) and size of the device registers. + The first region is the MAC register base and size. + The second region is external interface control register. +- interrupts: should contain the MAC interrupt. +- #address-cells: must be <1>. +- #size-cells: must be <0>. +- phy-mode: see ethernet.txt [1]. +- phy-handle: see ethernet.txt [1]. +- mac-address: see ethernet.txt [1]. +- clocks: clock phandle and specifier pair. + +- PHY subnode: inherits from phy binding [2] + +[1] Documentation/devicetree/bindings/net/ethernet.txt +[2] Documentation/devicetree/bindings/net/phy.txt + +Example: + gmac0: ethernet@f9840000 { + compatible = "hisilicon,hix5hd2-gmac"; + reg = <0xf9840000 0x1000>,<0xf984300c 0x4>; + interrupts = <0 71 4>; + #address-cells = <1>; + #size-cells = <0>; + phy-mode = "mii"; + phy-handle = <&phy2>; + mac-address = [00 00 00 00 00 00]; + clocks = <&clock HIX5HD2_MAC0_CLK>; + + phy2: ethernet-phy@2 { + reg = <2>; + }; + }; diff --git a/Documentation/devicetree/bindings/net/ieee802154/at86rf230.txt b/Documentation/devicetree/bindings/net/ieee802154/at86rf230.txt new file mode 100644 index 000000000000..d3bbdded4cbe --- /dev/null +++ b/Documentation/devicetree/bindings/net/ieee802154/at86rf230.txt @@ -0,0 +1,23 @@ +* AT86RF230 IEEE 802.15.4 * + +Required properties: + - compatible: should be "atmel,at86rf230", "atmel,at86rf231", + "atmel,at86rf233" or "atmel,at86rf212" + - spi-max-frequency: maximal bus speed, should be set to 7500000 depends + sync or async operation mode + - reg: the chipselect index + - interrupts: the interrupt generated by the device + +Optional properties: + - reset-gpio: GPIO spec for the rstn pin + - sleep-gpio: GPIO spec for the slp_tr pin + +Example: + + at86rf231@0 { + compatible = "atmel,at86rf231"; + spi-max-frequency = <7500000>; + reg = <0>; + interrupts = <19 1>; + interrupt-parent = <&gpio3>; + }; diff --git a/Documentation/devicetree/bindings/net/micrel-ks8851.txt b/Documentation/devicetree/bindings/net/micrel-ks8851.txt index d54d0cc79487..bbdf9a7359a2 100644 --- a/Documentation/devicetree/bindings/net/micrel-ks8851.txt +++ b/Documentation/devicetree/bindings/net/micrel-ks8851.txt @@ -1,9 +1,18 @@ -Micrel KS8851 Ethernet mac +Micrel KS8851 Ethernet mac (MLL) Required properties: -- compatible = "micrel,ks8851-ml" of parallel interface +- compatible = "micrel,ks8851-mll" of parallel interface - reg : 2 physical address and size of registers for data and command - interrupts : interrupt connection +Micrel KS8851 Ethernet mac (SPI) + +Required properties: +- compatible = "micrel,ks8851" or the deprecated "ks8851" +- reg : chip select number +- interrupts : interrupt connection + Optional properties: -- vdd-supply: supply for Ethernet mac +- vdd-supply: analog 3.3V supply for Ethernet mac +- vdd-io-supply: digital 1.8V IO supply for Ethernet mac +- reset-gpios: reset_n input pin diff --git a/Documentation/devicetree/bindings/net/micrel-ksz9021.txt b/Documentation/devicetree/bindings/net/micrel-ksz9021.txt deleted file mode 100644 index 997a63f1aea1..000000000000 --- a/Documentation/devicetree/bindings/net/micrel-ksz9021.txt +++ /dev/null @@ -1,49 +0,0 @@ -Micrel KSZ9021 Gigabit Ethernet PHY - -Some boards require special tuning values, particularly when it comes to -clock delays. You can specify clock delay values by adding -micrel-specific properties to an Ethernet OF device node. - -All skew control options are specified in picoseconds. The minimum -value is 0, and the maximum value is 3000. - -Optional properties: - - rxc-skew-ps : Skew control of RXC pad - - rxdv-skew-ps : Skew control of RX CTL pad - - txc-skew-ps : Skew control of TXC pad - - txen-skew-ps : Skew control of TX_CTL pad - - rxd0-skew-ps : Skew control of RX data 0 pad - - rxd1-skew-ps : Skew control of RX data 1 pad - - rxd2-skew-ps : Skew control of RX data 2 pad - - rxd3-skew-ps : Skew control of RX data 3 pad - - txd0-skew-ps : Skew control of TX data 0 pad - - txd1-skew-ps : Skew control of TX data 1 pad - - txd2-skew-ps : Skew control of TX data 2 pad - - txd3-skew-ps : Skew control of TX data 3 pad - -Examples: - - /* Attach to an Ethernet device with autodetected PHY */ - &enet { - rxc-skew-ps = <3000>; - rxdv-skew-ps = <0>; - txc-skew-ps = <3000>; - txen-skew-ps = <0>; - status = "okay"; - }; - - /* Attach to an explicitly-specified PHY */ - mdio { - phy0: ethernet-phy@0 { - rxc-skew-ps = <3000>; - rxdv-skew-ps = <0>; - txc-skew-ps = <3000>; - txen-skew-ps = <0>; - reg = <0>; - }; - }; - ethernet@70000 { - status = "okay"; - phy = <&phy0>; - phy-mode = "rgmii-id"; - }; diff --git a/Documentation/devicetree/bindings/net/micrel-ksz90x1.txt b/Documentation/devicetree/bindings/net/micrel-ksz90x1.txt new file mode 100644 index 000000000000..692076fda0e5 --- /dev/null +++ b/Documentation/devicetree/bindings/net/micrel-ksz90x1.txt @@ -0,0 +1,83 @@ +Micrel KSZ9021/KSZ9031 Gigabit Ethernet PHY + +Some boards require special tuning values, particularly when it comes to +clock delays. You can specify clock delay values by adding +micrel-specific properties to an Ethernet OF device node. + +Note that these settings are applied after any phy-specific fixup from +phy_fixup_list (see phy_init_hw() from drivers/net/phy/phy_device.c), +and therefore may overwrite them. + +KSZ9021: + + All skew control options are specified in picoseconds. The minimum + value is 0, the maximum value is 3000, and it is incremented by 200ps + steps. + + Optional properties: + + - rxc-skew-ps : Skew control of RXC pad + - rxdv-skew-ps : Skew control of RX CTL pad + - txc-skew-ps : Skew control of TXC pad + - txen-skew-ps : Skew control of TX CTL pad + - rxd0-skew-ps : Skew control of RX data 0 pad + - rxd1-skew-ps : Skew control of RX data 1 pad + - rxd2-skew-ps : Skew control of RX data 2 pad + - rxd3-skew-ps : Skew control of RX data 3 pad + - txd0-skew-ps : Skew control of TX data 0 pad + - txd1-skew-ps : Skew control of TX data 1 pad + - txd2-skew-ps : Skew control of TX data 2 pad + - txd3-skew-ps : Skew control of TX data 3 pad + +KSZ9031: + + All skew control options are specified in picoseconds. The minimum + value is 0, and the maximum is property-dependent. The increment + step is 60ps. + + Optional properties: + + Maximum value of 1860: + + - rxc-skew-ps : Skew control of RX clock pad + - txc-skew-ps : Skew control of TX clock pad + + Maximum value of 900: + + - rxdv-skew-ps : Skew control of RX CTL pad + - txen-skew-ps : Skew control of TX CTL pad + - rxd0-skew-ps : Skew control of RX data 0 pad + - rxd1-skew-ps : Skew control of RX data 1 pad + - rxd2-skew-ps : Skew control of RX data 2 pad + - rxd3-skew-ps : Skew control of RX data 3 pad + - txd0-skew-ps : Skew control of TX data 0 pad + - txd1-skew-ps : Skew control of TX data 1 pad + - txd2-skew-ps : Skew control of TX data 2 pad + - txd3-skew-ps : Skew control of TX data 3 pad + +Examples: + + /* Attach to an Ethernet device with autodetected PHY */ + &enet { + rxc-skew-ps = <3000>; + rxdv-skew-ps = <0>; + txc-skew-ps = <3000>; + txen-skew-ps = <0>; + status = "okay"; + }; + + /* Attach to an explicitly-specified PHY */ + mdio { + phy0: ethernet-phy@0 { + rxc-skew-ps = <3000>; + rxdv-skew-ps = <0>; + txc-skew-ps = <3000>; + txen-skew-ps = <0>; + reg = <0>; + }; + }; + ethernet@70000 { + status = "okay"; + phy = <&phy0>; + phy-mode = "rgmii-id"; + }; diff --git a/Documentation/devicetree/bindings/net/nfc/pn544.txt b/Documentation/devicetree/bindings/net/nfc/pn544.txt new file mode 100644 index 000000000000..dab69f36167c --- /dev/null +++ b/Documentation/devicetree/bindings/net/nfc/pn544.txt @@ -0,0 +1,35 @@ +* NXP Semiconductors PN544 NFC Controller + +Required properties: +- compatible: Should be "nxp,pn544-i2c". +- clock-frequency: IC work frequency. +- reg: address on the bus +- interrupt-parent: phandle for the interrupt gpio controller +- interrupts: GPIO interrupt to which the chip is connected +- enable-gpios: Output GPIO pin used for enabling/disabling the PN544 +- firmware-gpios: Output GPIO pin used to enter firmware download mode + +Optional SoC Specific Properties: +- pinctrl-names: Contains only one value - "default". +- pintctrl-0: Specifies the pin control groups used for this controller. + +Example (for ARM-based BeagleBone with PN544 on I2C2): + +&i2c2 { + + status = "okay"; + + pn544: pn544@28 { + + compatible = "nxp,pn544-i2c"; + + reg = <0x28>; + clock-frequency = <400000>; + + interrupt-parent = <&gpio1>; + interrupts = <17 GPIO_ACTIVE_HIGH>; + + enable-gpios = <&gpio3 21 GPIO_ACTIVE_HIGH>; + firmware-gpios = <&gpio3 19 GPIO_ACTIVE_HIGH>; + }; +}; diff --git a/Documentation/devicetree/bindings/net/nfc/st21nfca.txt b/Documentation/devicetree/bindings/net/nfc/st21nfca.txt new file mode 100644 index 000000000000..e4faa2e8dfeb --- /dev/null +++ b/Documentation/devicetree/bindings/net/nfc/st21nfca.txt @@ -0,0 +1,33 @@ +* STMicroelectronics SAS. ST21NFCA NFC Controller + +Required properties: +- compatible: Should be "st,st21nfca_i2c". +- clock-frequency: I²C work frequency. +- reg: address on the bus +- interrupt-parent: phandle for the interrupt gpio controller +- interrupts: GPIO interrupt to which the chip is connected +- enable-gpios: Output GPIO pin used for enabling/disabling the ST21NFCA + +Optional SoC Specific Properties: +- pinctrl-names: Contains only one value - "default". +- pintctrl-0: Specifies the pin control groups used for this controller. + +Example (for ARM-based BeagleBoard xM with ST21NFCA on I2C2): + +&i2c2 { + + status = "okay"; + + st21nfca: st21nfca@1 { + + compatible = "st,st21nfca_i2c"; + + reg = <0x01>; + clock-frequency = <400000>; + + interrupt-parent = <&gpio5>; + interrupts = <2 IRQ_TYPE_LEVEL_LOW>; + + enable-gpios = <&gpio5 29 GPIO_ACTIVE_HIGH>; + }; +}; diff --git a/Documentation/devicetree/bindings/net/nfc/trf7970a.txt b/Documentation/devicetree/bindings/net/nfc/trf7970a.txt index 8dd3ef7bc56b..1e436133685f 100644 --- a/Documentation/devicetree/bindings/net/nfc/trf7970a.txt +++ b/Documentation/devicetree/bindings/net/nfc/trf7970a.txt @@ -12,6 +12,7 @@ Required properties: Optional SoC Specific Properties: - pinctrl-names: Contains only one value - "default". - pintctrl-0: Specifies the pin control groups used for this controller. +- autosuspend-delay: Specify autosuspend delay in milliseconds. Example (for ARM-based BeagleBone with TRF7970A on SPI1): @@ -29,6 +30,7 @@ Example (for ARM-based BeagleBone with TRF7970A on SPI1): ti,enable-gpios = <&gpio2 2 GPIO_ACTIVE_LOW>, <&gpio2 5 GPIO_ACTIVE_LOW>; vin-supply = <&ldo3_reg>; + autosuspend-delay = <30000>; status = "okay"; }; }; diff --git a/Documentation/devicetree/bindings/net/via-rhine.txt b/Documentation/devicetree/bindings/net/via-rhine.txt new file mode 100644 index 000000000000..334eca2bf937 --- /dev/null +++ b/Documentation/devicetree/bindings/net/via-rhine.txt @@ -0,0 +1,17 @@ +* VIA Rhine 10/100 Network Controller + +Required properties: +- compatible : Should be "via,vt8500-rhine" for integrated + Rhine controllers found in VIA VT8500, WonderMedia WM8950 + and similar. These are listed as 1106:3106 rev. 0x84 on the + virtual PCI bus under vendor-provided kernels +- reg : Address and length of the io space +- interrupts : Should contain the controller interrupt line + +Examples: + +ethernet@d8004000 { + compatible = "via,vt8500-rhine"; + reg = <0xd8004000 0x100>; + interrupts = <10>; +}; diff --git a/Documentation/driver-model/devres.txt b/Documentation/driver-model/devres.txt index 89472558011e..1525e30483fd 100644 --- a/Documentation/driver-model/devres.txt +++ b/Documentation/driver-model/devres.txt @@ -318,3 +318,8 @@ GPIO devm_gpiod_get_optional() devm_gpiod_get_index_optional() devm_gpiod_put() + +MDIO + devm_mdiobus_alloc() + devm_mdiobus_alloc_size() + devm_mdiobus_free() diff --git a/Documentation/networking/bonding.txt b/Documentation/networking/bonding.txt index a383c00392d0..9c723ecd0025 100644 --- a/Documentation/networking/bonding.txt +++ b/Documentation/networking/bonding.txt @@ -585,13 +585,19 @@ mode balance-tlb or 5 Adaptive transmit load balancing: channel bonding that - does not require any special switch support. The - outgoing traffic is distributed according to the - current load (computed relative to the speed) on each - slave. Incoming traffic is received by the current - slave. If the receiving slave fails, another slave - takes over the MAC address of the failed receiving - slave. + does not require any special switch support. + + In tlb_dynamic_lb=1 mode; the outgoing traffic is + distributed according to the current load (computed + relative to the speed) on each slave. + + In tlb_dynamic_lb=0 mode; the load balancing based on + current load is disabled and the load is distributed + only using the hash distribution. + + Incoming traffic is received by the current slave. + If the receiving slave fails, another slave takes over + the MAC address of the failed receiving slave. Prerequisite: @@ -736,6 +742,28 @@ primary_reselect This option was added for bonding version 3.6.0. +tlb_dynamic_lb + + Specifies if dynamic shuffling of flows is enabled in tlb + mode. The value has no effect on any other modes. + + The default behavior of tlb mode is to shuffle active flows across + slaves based on the load in that interval. This gives nice lb + characteristics but can cause packet reordering. If re-ordering is + a concern use this variable to disable flow shuffling and rely on + load balancing provided solely by the hash distribution. + xmit-hash-policy can be used to select the appropriate hashing for + the setup. + + The sysfs entry can be used to change the setting per bond device + and the initial value is derived from the module parameter. The + sysfs entry is allowed to be changed only if the bond device is + down. + + The default value is "1" that enables flow shuffling while value "0" + disables it. This option was added in bonding driver 3.7.1 + + updelay Specifies the time, in milliseconds, to wait before enabling a @@ -769,7 +797,7 @@ use_carrier xmit_hash_policy Selects the transmit hash policy to use for slave selection in - balance-xor and 802.3ad modes. Possible values are: + balance-xor, 802.3ad, and tlb modes. Possible values are: layer2 diff --git a/Documentation/networking/can.txt b/Documentation/networking/can.txt index 4f7ae5261364..2236d6dcb7da 100644 --- a/Documentation/networking/can.txt +++ b/Documentation/networking/can.txt @@ -469,6 +469,41 @@ solution for a couple of reasons: having this 'send only' use-case we may remove the receive list in the Kernel to save a little (really a very little!) CPU usage. + 4.1.1.1 CAN filter usage optimisation + + The CAN filters are processed in per-device filter lists at CAN frame + reception time. To reduce the number of checks that need to be performed + while walking through the filter lists the CAN core provides an optimized + filter handling when the filter subscription focusses on a single CAN ID. + + For the possible 2048 SFF CAN identifiers the identifier is used as an index + to access the corresponding subscription list without any further checks. + For the 2^29 possible EFF CAN identifiers a 10 bit XOR folding is used as + hash function to retrieve the EFF table index. + + To benefit from the optimized filters for single CAN identifiers the + CAN_SFF_MASK or CAN_EFF_MASK have to be set into can_filter.mask together + with set CAN_EFF_FLAG and CAN_RTR_FLAG bits. A set CAN_EFF_FLAG bit in the + can_filter.mask makes clear that it matters whether a SFF or EFF CAN ID is + subscribed. E.g. in the example from above + + rfilter[0].can_id = 0x123; + rfilter[0].can_mask = CAN_SFF_MASK; + + both SFF frames with CAN ID 0x123 and EFF frames with 0xXXXXX123 can pass. + + To filter for only 0x123 (SFF) and 0x12345678 (EFF) CAN identifiers the + filter has to be defined in this way to benefit from the optimized filters: + + struct can_filter rfilter[2]; + + rfilter[0].can_id = 0x123; + rfilter[0].can_mask = (CAN_EFF_FLAG | CAN_RTR_FLAG | CAN_SFF_MASK); + rfilter[1].can_id = 0x12345678 | CAN_EFF_FLAG; + rfilter[1].can_mask = (CAN_EFF_FLAG | CAN_RTR_FLAG | CAN_EFF_MASK); + + setsockopt(s, SOL_CAN_RAW, CAN_RAW_FILTER, &rfilter, sizeof(rfilter)); + 4.1.2 RAW socket option CAN_RAW_ERR_FILTER As described in chapter 3.4 the CAN interface driver can generate so diff --git a/Documentation/networking/cdc_mbim.txt b/Documentation/networking/cdc_mbim.txt new file mode 100644 index 000000000000..a15ea602aa52 --- /dev/null +++ b/Documentation/networking/cdc_mbim.txt @@ -0,0 +1,339 @@ + cdc_mbim - Driver for CDC MBIM Mobile Broadband modems + ======================================================== + +The cdc_mbim driver supports USB devices conforming to the "Universal +Serial Bus Communications Class Subclass Specification for Mobile +Broadband Interface Model" [1], which is a further development of +"Universal Serial Bus Communications Class Subclass Specifications for +Network Control Model Devices" [2] optimized for Mobile Broadband +devices, aka "3G/LTE modems". + + +Command Line Parameters +======================= + +The cdc_mbim driver has no parameters of its own. But the probing +behaviour for NCM 1.0 backwards compatible MBIM functions (an +"NCM/MBIM function" as defined in section 3.2 of [1]) is affected +by a cdc_ncm driver parameter: + +prefer_mbim +----------- +Type: Boolean +Valid Range: N/Y (0-1) +Default Value: Y (MBIM is preferred) + +This parameter sets the system policy for NCM/MBIM functions. Such +functions will be handled by either the cdc_ncm driver or the cdc_mbim +driver depending on the prefer_mbim setting. Setting prefer_mbim=N +makes the cdc_mbim driver ignore these functions and lets the cdc_ncm +driver handle them instead. + +The parameter is writable, and can be changed at any time. A manual +unbind/bind is required to make the change effective for NCM/MBIM +functions bound to the "wrong" driver + + +Basic usage +=========== + +MBIM functions are inactive when unmanaged. The cdc_mbim driver only +provides an userspace interface to the MBIM control channel, and will +not participate in the management of the function. This implies that a +userspace MBIM management application always is required to enable a +MBIM function. + +Such userspace applications includes, but are not limited to: + - mbimcli (included with the libmbim [3] library), and + - ModemManager [4] + +Establishing a MBIM IP session reequires at least these actions by the +management application: + - open the control channel + - configure network connection settings + - connect to network + - configure IP interface + +Management application development +---------------------------------- +The driver <-> userspace interfaces are described below. The MBIM +control channel protocol is described in [1]. + + +MBIM control channel userspace ABI +================================== + +/dev/cdc-wdmX character device +------------------------------ +The driver creates a two-way pipe to the MBIM function control channel +using the cdc-wdm driver as a subdriver. The userspace end of the +control channel pipe is a /dev/cdc-wdmX character device. + +The cdc_mbim driver does not process or police messages on the control +channel. The channel is fully delegated to the userspace management +application. It is therefore up to this application to ensure that it +complies with all the control channel requirements in [1]. + +The cdc-wdmX device is created as a child of the MBIM control +interface USB device. The character device associated with a specific +MBIM function can be looked up using sysfs. For example: + + bjorn@nemi:~$ ls /sys/bus/usb/drivers/cdc_mbim/2-4:2.12/usbmisc + cdc-wdm0 + + bjorn@nemi:~$ grep . /sys/bus/usb/drivers/cdc_mbim/2-4:2.12/usbmisc/cdc-wdm0/dev + 180:0 + + +USB configuration descriptors +----------------------------- +The wMaxControlMessage field of the CDC MBIM functional descriptor +limits the maximum control message size. The managament application is +responsible for negotiating a control message size complying with the +requirements in section 9.3.1 of [1], taking this descriptor field +into consideration. + +The userspace application can access the CDC MBIM functional +descriptor of a MBIM function using either of the two USB +configuration descriptor kernel interfaces described in [6] or [7]. + +See also the ioctl documentation below. + + +Fragmentation +------------- +The userspace application is responsible for all control message +fragmentation and defragmentaion, as described in section 9.5 of [1]. + + +/dev/cdc-wdmX write() +--------------------- +The MBIM control messages from the management application *must not* +exceed the negotiated control message size. + + +/dev/cdc-wdmX read() +-------------------- +The management application *must* accept control messages of up the +negotiated control message size. + + +/dev/cdc-wdmX ioctl() +-------------------- +IOCTL_WDM_MAX_COMMAND: Get Maximum Command Size +This ioctl returns the wMaxControlMessage field of the CDC MBIM +functional descriptor for MBIM devices. This is intended as a +convenience, eliminating the need to parse the USB descriptors from +userspace. + + #include <stdio.h> + #include <fcntl.h> + #include <sys/ioctl.h> + #include <linux/types.h> + #include <linux/usb/cdc-wdm.h> + int main() + { + __u16 max; + int fd = open("/dev/cdc-wdm0", O_RDWR); + if (!ioctl(fd, IOCTL_WDM_MAX_COMMAND, &max)) + printf("wMaxControlMessage is %d\n", max); + } + + +Custom device services +---------------------- +The MBIM specification allows vendors to freely define additional +services. This is fully supported by the cdc_mbim driver. + +Support for new MBIM services, including vendor specified services, is +implemented entirely in userspace, like the rest of the MBIM control +protocol + +New services should be registered in the MBIM Registry [5]. + + + +MBIM data channel userspace ABI +=============================== + +wwanY network device +-------------------- +The cdc_mbim driver represents the MBIM data channel as a single +network device of the "wwan" type. This network device is initially +mapped to MBIM IP session 0. + + +Multiplexed IP sessions (IPS) +----------------------------- +MBIM allows multiplexing up to 256 IP sessions over a single USB data +channel. The cdc_mbim driver models such IP sessions as 802.1q VLAN +subdevices of the master wwanY device, mapping MBIM IP session Z to +VLAN ID Z for all values of Z greater than 0. + +The device maximum Z is given in the MBIM_DEVICE_CAPS_INFO structure +described in section 10.5.1 of [1]. + +The userspace management application is responsible for adding new +VLAN links prior to establishing MBIM IP sessions where the SessionId +is greater than 0. These links can be added by using the normal VLAN +kernel interfaces, either ioctl or netlink. + +For example, adding a link for a MBIM IP session with SessionId 3: + + ip link add link wwan0 name wwan0.3 type vlan id 3 + +The driver will automatically map the "wwan0.3" network device to MBIM +IP session 3. + + +Device Service Streams (DSS) +---------------------------- +MBIM also allows up to 256 non-IP data streams to be multiplexed over +the same shared USB data channel. The cdc_mbim driver models these +sessions as another set of 802.1q VLAN subdevices of the master wwanY +device, mapping MBIM DSS session A to VLAN ID (256 + A) for all values +of A. + +The device maximum A is given in the MBIM_DEVICE_SERVICES_INFO +structure described in section 10.5.29 of [1]. + +The DSS VLAN subdevices are used as a practical interface between the +shared MBIM data channel and a MBIM DSS aware userspace application. +It is not intended to be presented as-is to an end user. The +assumption is that an userspace application initiating a DSS session +also takes care of the necessary framing of the DSS data, presenting +the stream to the end user in an appropriate way for the stream type. + +The network device ABI requires a dummy ethernet header for every DSS +data frame being transported. The contents of this header is +arbitrary, with the following exceptions: + - TX frames using an IP protocol (0x0800 or 0x86dd) will be dropped + - RX frames will have the protocol field set to ETH_P_802_3 (but will + not be properly formatted 802.3 frames) + - RX frames will have the destination address set to the hardware + address of the master device + +The DSS supporting userspace management application is responsible for +adding the dummy ethernet header on TX and stripping it on RX. + +This is a simple example using tools commonly available, exporting +DssSessionId 5 as a pty character device pointed to by a /dev/nmea +symlink: + + ip link add link wwan0 name wwan0.dss5 type vlan id 261 + ip link set dev wwan0.dss5 up + socat INTERFACE:wwan0.dss5,type=2 PTY:,echo=0,link=/dev/nmea + +This is only an example, most suitable for testing out a DSS +service. Userspace applications supporting specific MBIM DSS services +are expected to use the tools and programming interfaces required by +that service. + +Note that adding VLAN links for DSS sessions is entirely optional. A +management application may instead choose to bind a packet socket +directly to the master network device, using the received VLAN tags to +map frames to the correct DSS session and adding 18 byte VLAN ethernet +headers with the appropriate tag on TX. In this case using a socket +filter is recommended, matching only the DSS VLAN subset. This avoid +unnecessary copying of unrelated IP session data to userspace. For +example: + + static struct sock_filter dssfilter[] = { + /* use special negative offsets to get VLAN tag */ + BPF_STMT(BPF_LD|BPF_B|BPF_ABS, SKF_AD_OFF + SKF_AD_VLAN_TAG_PRESENT), + BPF_JUMP(BPF_JMP|BPF_JEQ|BPF_K, 1, 0, 6), /* true */ + + /* verify DSS VLAN range */ + BPF_STMT(BPF_LD|BPF_H|BPF_ABS, SKF_AD_OFF + SKF_AD_VLAN_TAG), + BPF_JUMP(BPF_JMP|BPF_JGE|BPF_K, 256, 0, 4), /* 256 is first DSS VLAN */ + BPF_JUMP(BPF_JMP|BPF_JGE|BPF_K, 512, 3, 0), /* 511 is last DSS VLAN */ + + /* verify ethertype */ + BPF_STMT(BPF_LD|BPF_H|BPF_ABS, 2 * ETH_ALEN), + BPF_JUMP(BPF_JMP|BPF_JEQ|BPF_K, ETH_P_802_3, 0, 1), + + BPF_STMT(BPF_RET|BPF_K, (u_int)-1), /* accept */ + BPF_STMT(BPF_RET|BPF_K, 0), /* ignore */ + }; + + + +Tagged IP session 0 VLAN +------------------------ +As described above, MBIM IP session 0 is treated as special by the +driver. It is initially mapped to untagged frames on the wwanY +network device. + +This mapping implies a few restrictions on multiplexed IPS and DSS +sessions, which may not always be practical: + - no IPS or DSS session can use a frame size greater than the MTU on + IP session 0 + - no IPS or DSS session can be in the up state unless the network + device representing IP session 0 also is up + +These problems can be avoided by optionally making the driver map IP +session 0 to a VLAN subdevice, similar to all other IP sessions. This +behaviour is triggered by adding a VLAN link for the magic VLAN ID +4094. The driver will then immediately start mapping MBIM IP session +0 to this VLAN, and will drop untagged frames on the master wwanY +device. + +Tip: It might be less confusing to the end user to name this VLAN +subdevice after the MBIM SessionID instead of the VLAN ID. For +example: + + ip link add link wwan0 name wwan0.0 type vlan id 4094 + + +VLAN mapping +------------ + +Summarizing the cdc_mbim driver mapping described above, we have this +relationship between VLAN tags on the wwanY network device and MBIM +sessions on the shared USB data channel: + + VLAN ID MBIM type MBIM SessionID Notes + --------------------------------------------------------- + untagged IPS 0 a) + 1 - 255 IPS 1 - 255 <VLANID> + 256 - 511 DSS 0 - 255 <VLANID - 256> + 512 - 4093 b) + 4094 IPS 0 c) + + a) if no VLAN ID 4094 link exists, else dropped + b) unsupported VLAN range, unconditionally dropped + c) if a VLAN ID 4094 link exists, else dropped + + + + +References +========== + +[1] USB Implementers Forum, Inc. - "Universal Serial Bus + Communications Class Subclass Specification for Mobile Broadband + Interface Model", Revision 1.0 (Errata 1), May 1, 2013 + - http://www.usb.org/developers/docs/devclass_docs/ + +[2] USB Implementers Forum, Inc. - "Universal Serial Bus + Communications Class Subclass Specifications for Network Control + Model Devices", Revision 1.0 (Errata 1), November 24, 2010 + - http://www.usb.org/developers/docs/devclass_docs/ + +[3] libmbim - "a glib-based library for talking to WWAN modems and + devices which speak the Mobile Interface Broadband Model (MBIM) + protocol" + - http://www.freedesktop.org/wiki/Software/libmbim/ + +[4] ModemManager - "a DBus-activated daemon which controls mobile + broadband (2G/3G/4G) devices and connections" + - http://www.freedesktop.org/wiki/Software/ModemManager/ + +[5] "MBIM (Mobile Broadband Interface Model) Registry" + - http://compliance.usb.org/mbim/ + +[6] "/proc/bus/usb filesystem output" + - Documentation/usb/proc_usb_info.txt + +[7] "/sys/bus/usb/devices/.../descriptors" + - Documentation/ABI/stable/sysfs-bus-usb diff --git a/Documentation/networking/filter.txt b/Documentation/networking/filter.txt index e3ba753cb714..ee78eba78a9d 100644 --- a/Documentation/networking/filter.txt +++ b/Documentation/networking/filter.txt @@ -281,6 +281,7 @@ Possible BPF extensions are shown in the following table: cpu raw_smp_processor_id() vlan_tci vlan_tx_tag_get(skb) vlan_pr vlan_tx_tag_present(skb) + rand prandom_u32() These extensions can also be prefixed with '#'. Examples for low-level BPF: @@ -308,6 +309,18 @@ Examples for low-level BPF: ret #-1 drop: ret #0 +** icmp random packet sampling, 1 in 4 + ldh [12] + jne #0x800, drop + ldb [23] + jneq #1, drop + # get a random uint32 number + ld rand + mod #4 + jneq #1, drop + ret #-1 + drop: ret #0 + ** SECCOMP filter example: ld [4] /* offsetof(struct seccomp_data, arch) */ @@ -548,42 +561,43 @@ toolchain for developing and testing the kernel's JIT compiler. BPF kernel internals -------------------- -Internally, for the kernel interpreter, a different BPF instruction set +Internally, for the kernel interpreter, a different instruction set format with similar underlying principles from BPF described in previous paragraphs is being used. However, the instruction set format is modelled closer to the underlying architecture to mimic native instruction sets, so -that a better performance can be achieved (more details later). +that a better performance can be achieved (more details later). This new +ISA is called 'eBPF' or 'internal BPF' interchangeably. (Note: eBPF which +originates from [e]xtended BPF is not the same as BPF extensions! While +eBPF is an ISA, BPF extensions date back to classic BPF's 'overloading' +of BPF_LD | BPF_{B,H,W} | BPF_ABS instruction.) It is designed to be JITed with one to one mapping, which can also open up -the possibility for GCC/LLVM compilers to generate optimized BPF code through -a BPF backend that performs almost as fast as natively compiled code. +the possibility for GCC/LLVM compilers to generate optimized eBPF code through +an eBPF backend that performs almost as fast as natively compiled code. The new instruction set was originally designed with the possible goal in -mind to write programs in "restricted C" and compile into BPF with a optional +mind to write programs in "restricted C" and compile into eBPF with a optional GCC/LLVM backend, so that it can just-in-time map to modern 64-bit CPUs with -minimal performance overhead over two steps, that is, C -> BPF -> native code. +minimal performance overhead over two steps, that is, C -> eBPF -> native code. Currently, the new format is being used for running user BPF programs, which includes seccomp BPF, classic socket filters, cls_bpf traffic classifier, team driver's classifier for its load-balancing mode, netfilter's xt_bpf extension, PTP dissector/classifier, and much more. They are all internally converted by the kernel into the new instruction set representation and run -in the extended interpreter. For in-kernel handlers, this all works -transparently by using sk_unattached_filter_create() for setting up the -filter, resp. sk_unattached_filter_destroy() for destroying it. The macro -SK_RUN_FILTER(filter, ctx) transparently invokes the right BPF function to -run the filter. 'filter' is a pointer to struct sk_filter that we got from -sk_unattached_filter_create(), and 'ctx' the given context (e.g. skb pointer). -All constraints and restrictions from sk_chk_filter() apply before a -conversion to the new layout is being done behind the scenes! - -Currently, for JITing, the user BPF format is being used and current BPF JIT -compilers reused whenever possible. In other words, we do not (yet!) perform -a JIT compilation in the new layout, however, future work will successively -migrate traditional JIT compilers into the new instruction format as well, so -that they will profit from the very same benefits. Thus, when speaking about -JIT in the following, a JIT compiler (TBD) for the new instruction format is -meant in this context. +in the eBPF interpreter. For in-kernel handlers, this all works transparently +by using sk_unattached_filter_create() for setting up the filter, resp. +sk_unattached_filter_destroy() for destroying it. The macro +SK_RUN_FILTER(filter, ctx) transparently invokes eBPF interpreter or JITed +code to run the filter. 'filter' is a pointer to struct sk_filter that we +got from sk_unattached_filter_create(), and 'ctx' the given context (e.g. +skb pointer). All constraints and restrictions from sk_chk_filter() apply +before a conversion to the new layout is being done behind the scenes! + +Currently, the classic BPF format is being used for JITing on most of the +architectures. Only x86-64 performs JIT compilation from eBPF instruction set, +however, future work will migrate other JIT compilers as well, so that they +will profit from the very same benefits. Some core changes of the new internal format: @@ -592,35 +606,35 @@ Some core changes of the new internal format: The old format had two registers A and X, and a hidden frame pointer. The new layout extends this to be 10 internal registers and a read-only frame pointer. Since 64-bit CPUs are passing arguments to functions via registers - the number of args from BPF program to in-kernel function is restricted + the number of args from eBPF program to in-kernel function is restricted to 5 and one register is used to accept return value from an in-kernel function. Natively, x86_64 passes first 6 arguments in registers, aarch64/ sparcv9/mips64 have 7 - 8 registers for arguments; x86_64 has 6 callee saved registers, and aarch64/sparcv9/mips64 have 11 or more callee saved registers. - Therefore, BPF calling convention is defined as: + Therefore, eBPF calling convention is defined as: - * R0 - return value from in-kernel function - * R1 - R5 - arguments from BPF program to in-kernel function + * R0 - return value from in-kernel function, and exit value for eBPF program + * R1 - R5 - arguments from eBPF program to in-kernel function * R6 - R9 - callee saved registers that in-kernel function will preserve * R10 - read-only frame pointer to access stack - Thus, all BPF registers map one to one to HW registers on x86_64, aarch64, - etc, and BPF calling convention maps directly to ABIs used by the kernel on + Thus, all eBPF registers map one to one to HW registers on x86_64, aarch64, + etc, and eBPF calling convention maps directly to ABIs used by the kernel on 64-bit architectures. On 32-bit architectures JIT may map programs that use only 32-bit arithmetic and may let more complex programs to be interpreted. - R0 - R5 are scratch registers and BPF program needs spill/fill them if - necessary across calls. Note that there is only one BPF program (== one BPF - main routine) and it cannot call other BPF functions, it can only call - predefined in-kernel functions, though. + R0 - R5 are scratch registers and eBPF program needs spill/fill them if + necessary across calls. Note that there is only one eBPF program (== one + eBPF main routine) and it cannot call other eBPF functions, it can only + call predefined in-kernel functions, though. - Register width increases from 32-bit to 64-bit: Still, the semantics of the original 32-bit ALU operations are preserved - via 32-bit subregisters. All BPF registers are 64-bit with 32-bit lower + via 32-bit subregisters. All eBPF registers are 64-bit with 32-bit lower subregisters that zero-extend into 64-bit if they are being written to. That behavior maps directly to x86_64 and arm64 subregister definition, but makes other JITs more difficult. @@ -631,8 +645,8 @@ Some core changes of the new internal format: Operation is 64-bit, because on 64-bit architectures, pointers are also 64-bit wide, and we want to pass 64-bit values in/out of kernel functions, - so 32-bit BPF registers would otherwise require to define register-pair - ABI, thus, there won't be able to use a direct BPF register to HW register + so 32-bit eBPF registers would otherwise require to define register-pair + ABI, thus, there won't be able to use a direct eBPF register to HW register mapping and JIT would need to do combine/split/move operations for every register in and out of the function, which is complex, bug prone and slow. Another reason is the use of atomic 64-bit counters. @@ -646,14 +660,145 @@ Some core changes of the new internal format: - Introduces bpf_call insn and register passing convention for zero overhead calls from/to other kernel functions: - After a kernel function call, R1 - R5 are reset to unreadable and R0 has a - return type of the function. Since R6 - R9 are callee saved, their state is - preserved across the call. - -Also in the new design, BPF is limited to 4096 insns, which means that any + Before an in-kernel function call, the internal BPF program needs to + place function arguments into R1 to R5 registers to satisfy calling + convention, then the interpreter will take them from registers and pass + to in-kernel function. If R1 - R5 registers are mapped to CPU registers + that are used for argument passing on given architecture, the JIT compiler + doesn't need to emit extra moves. Function arguments will be in the correct + registers and BPF_CALL instruction will be JITed as single 'call' HW + instruction. This calling convention was picked to cover common call + situations without performance penalty. + + After an in-kernel function call, R1 - R5 are reset to unreadable and R0 has + a return value of the function. Since R6 - R9 are callee saved, their state + is preserved across the call. + + For example, consider three C functions: + + u64 f1() { return (*_f2)(1); } + u64 f2(u64 a) { return f3(a + 1, a); } + u64 f3(u64 a, u64 b) { return a - b; } + + GCC can compile f1, f3 into x86_64: + + f1: + movl $1, %edi + movq _f2(%rip), %rax + jmp *%rax + f3: + movq %rdi, %rax + subq %rsi, %rax + ret + + Function f2 in eBPF may look like: + + f2: + bpf_mov R2, R1 + bpf_add R1, 1 + bpf_call f3 + bpf_exit + + If f2 is JITed and the pointer stored to '_f2'. The calls f1 -> f2 -> f3 and + returns will be seamless. Without JIT, __sk_run_filter() interpreter needs to + be used to call into f2. + + For practical reasons all eBPF programs have only one argument 'ctx' which is + already placed into R1 (e.g. on __sk_run_filter() startup) and the programs + can call kernel functions with up to 5 arguments. Calls with 6 or more arguments + are currently not supported, but these restrictions can be lifted if necessary + in the future. + + On 64-bit architectures all register map to HW registers one to one. For + example, x86_64 JIT compiler can map them as ... + + R0 - rax + R1 - rdi + R2 - rsi + R3 - rdx + R4 - rcx + R5 - r8 + R6 - rbx + R7 - r13 + R8 - r14 + R9 - r15 + R10 - rbp + + ... since x86_64 ABI mandates rdi, rsi, rdx, rcx, r8, r9 for argument passing + and rbx, r12 - r15 are callee saved. + + Then the following internal BPF pseudo-program: + + bpf_mov R6, R1 /* save ctx */ + bpf_mov R2, 2 + bpf_mov R3, 3 + bpf_mov R4, 4 + bpf_mov R5, 5 + bpf_call foo + bpf_mov R7, R0 /* save foo() return value */ + bpf_mov R1, R6 /* restore ctx for next call */ + bpf_mov R2, 6 + bpf_mov R3, 7 + bpf_mov R4, 8 + bpf_mov R5, 9 + bpf_call bar + bpf_add R0, R7 + bpf_exit + + After JIT to x86_64 may look like: + + push %rbp + mov %rsp,%rbp + sub $0x228,%rsp + mov %rbx,-0x228(%rbp) + mov %r13,-0x220(%rbp) + mov %rdi,%rbx + mov $0x2,%esi + mov $0x3,%edx + mov $0x4,%ecx + mov $0x5,%r8d + callq foo + mov %rax,%r13 + mov %rbx,%rdi + mov $0x2,%esi + mov $0x3,%edx + mov $0x4,%ecx + mov $0x5,%r8d + callq bar + add %r13,%rax + mov -0x228(%rbp),%rbx + mov -0x220(%rbp),%r13 + leaveq + retq + + Which is in this example equivalent in C to: + + u64 bpf_filter(u64 ctx) + { + return foo(ctx, 2, 3, 4, 5) + bar(ctx, 6, 7, 8, 9); + } + + In-kernel functions foo() and bar() with prototype: u64 (*)(u64 arg1, u64 + arg2, u64 arg3, u64 arg4, u64 arg5); will receive arguments in proper + registers and place their return value into '%rax' which is R0 in eBPF. + Prologue and epilogue are emitted by JIT and are implicit in the + interpreter. R0-R5 are scratch registers, so eBPF program needs to preserve + them across the calls as defined by calling convention. + + For example the following program is invalid: + + bpf_mov R1, 1 + bpf_call foo + bpf_mov R0, R1 + bpf_exit + + After the call the registers R1-R5 contain junk values and cannot be read. + In the future an eBPF verifier can be used to validate internal BPF programs. + +Also in the new design, eBPF is limited to 4096 insns, which means that any program will terminate quickly and will only call a fixed number of kernel functions. Original BPF and the new format are two operand instructions, -which helps to do one-to-one mapping between BPF insn and x86 insn during JIT. +which helps to do one-to-one mapping between eBPF insn and x86 insn during JIT. The input context pointer for invoking the interpreter function is generic, its content is defined by a specific use case. For seccomp register R1 points @@ -661,7 +806,26 @@ to seccomp_data, for converted BPF filters R1 points to a skb. A program, that is translated internally consists of the following elements: - op:16, jt:8, jf:8, k:32 ==> op:8, a_reg:4, x_reg:4, off:16, imm:32 + op:16, jt:8, jf:8, k:32 ==> op:8, dst_reg:4, src_reg:4, off:16, imm:32 + +So far 87 internal BPF instructions were implemented. 8-bit 'op' opcode field +has room for new instructions. Some of them may use 16/24/32 byte encoding. New +instructions must be multiple of 8 bytes to preserve backward compatibility. + +Internal BPF is a general purpose RISC instruction set. Not every register and +every instruction are used during translation from original BPF to new format. +For example, socket filters are not using 'exclusive add' instruction, but +tracing filters may do to maintain counters of events, for example. Register R9 +is not used by socket filters either, but more complex filters may be running +out of registers and would have to resort to spill/fill to stack. + +Internal BPF can used as generic assembler for last step performance +optimizations, socket filters and seccomp are using it as assembler. Tracing +filters may use it as assembler to generate code from kernel. In kernel usage +may not be bounded by security considerations, since generated internal BPF code +may be optimizing internal code path and not being exposed to the user space. +Safety of internal BPF can come from a verifier (TBD). In such use cases as +described, it may be used as safe instruction set. Just like the original BPF, the new format runs within a controlled environment, is deterministic and the kernel can easily prove that. The safety of the program @@ -670,6 +834,181 @@ loops and other CFG validation; second step starts from the first insn and descends all possible paths. It simulates execution of every insn and observes the state change of registers and stack. +eBPF opcode encoding +-------------------- + +eBPF is reusing most of the opcode encoding from classic to simplify conversion +of classic BPF to eBPF. For arithmetic and jump instructions the 8-bit 'code' +field is divided into three parts: + + +----------------+--------+--------------------+ + | 4 bits | 1 bit | 3 bits | + | operation code | source | instruction class | + +----------------+--------+--------------------+ + (MSB) (LSB) + +Three LSB bits store instruction class which is one of: + + Classic BPF classes: eBPF classes: + + BPF_LD 0x00 BPF_LD 0x00 + BPF_LDX 0x01 BPF_LDX 0x01 + BPF_ST 0x02 BPF_ST 0x02 + BPF_STX 0x03 BPF_STX 0x03 + BPF_ALU 0x04 BPF_ALU 0x04 + BPF_JMP 0x05 BPF_JMP 0x05 + BPF_RET 0x06 [ class 6 unused, for future if needed ] + BPF_MISC 0x07 BPF_ALU64 0x07 + +When BPF_CLASS(code) == BPF_ALU or BPF_JMP, 4th bit encodes source operand ... + + BPF_K 0x00 + BPF_X 0x08 + + * in classic BPF, this means: + + BPF_SRC(code) == BPF_X - use register X as source operand + BPF_SRC(code) == BPF_K - use 32-bit immediate as source operand + + * in eBPF, this means: + + BPF_SRC(code) == BPF_X - use 'src_reg' register as source operand + BPF_SRC(code) == BPF_K - use 32-bit immediate as source operand + +... and four MSB bits store operation code. + +If BPF_CLASS(code) == BPF_ALU or BPF_ALU64 [ in eBPF ], BPF_OP(code) is one of: + + BPF_ADD 0x00 + BPF_SUB 0x10 + BPF_MUL 0x20 + BPF_DIV 0x30 + BPF_OR 0x40 + BPF_AND 0x50 + BPF_LSH 0x60 + BPF_RSH 0x70 + BPF_NEG 0x80 + BPF_MOD 0x90 + BPF_XOR 0xa0 + BPF_MOV 0xb0 /* eBPF only: mov reg to reg */ + BPF_ARSH 0xc0 /* eBPF only: sign extending shift right */ + BPF_END 0xd0 /* eBPF only: endianness conversion */ + +If BPF_CLASS(code) == BPF_JMP, BPF_OP(code) is one of: + + BPF_JA 0x00 + BPF_JEQ 0x10 + BPF_JGT 0x20 + BPF_JGE 0x30 + BPF_JSET 0x40 + BPF_JNE 0x50 /* eBPF only: jump != */ + BPF_JSGT 0x60 /* eBPF only: signed '>' */ + BPF_JSGE 0x70 /* eBPF only: signed '>=' */ + BPF_CALL 0x80 /* eBPF only: function call */ + BPF_EXIT 0x90 /* eBPF only: function return */ + +So BPF_ADD | BPF_X | BPF_ALU means 32-bit addition in both classic BPF +and eBPF. There are only two registers in classic BPF, so it means A += X. +In eBPF it means dst_reg = (u32) dst_reg + (u32) src_reg; similarly, +BPF_XOR | BPF_K | BPF_ALU means A ^= imm32 in classic BPF and analogous +src_reg = (u32) src_reg ^ (u32) imm32 in eBPF. + +Classic BPF is using BPF_MISC class to represent A = X and X = A moves. +eBPF is using BPF_MOV | BPF_X | BPF_ALU code instead. Since there are no +BPF_MISC operations in eBPF, the class 7 is used as BPF_ALU64 to mean +exactly the same operations as BPF_ALU, but with 64-bit wide operands +instead. So BPF_ADD | BPF_X | BPF_ALU64 means 64-bit addition, i.e.: +dst_reg = dst_reg + src_reg + +Classic BPF wastes the whole BPF_RET class to represent a single 'ret' +operation. Classic BPF_RET | BPF_K means copy imm32 into return register +and perform function exit. eBPF is modeled to match CPU, so BPF_JMP | BPF_EXIT +in eBPF means function exit only. The eBPF program needs to store return +value into register R0 before doing a BPF_EXIT. Class 6 in eBPF is currently +unused and reserved for future use. + +For load and store instructions the 8-bit 'code' field is divided as: + + +--------+--------+-------------------+ + | 3 bits | 2 bits | 3 bits | + | mode | size | instruction class | + +--------+--------+-------------------+ + (MSB) (LSB) + +Size modifier is one of ... + + BPF_W 0x00 /* word */ + BPF_H 0x08 /* half word */ + BPF_B 0x10 /* byte */ + BPF_DW 0x18 /* eBPF only, double word */ + +... which encodes size of load/store operation: + + B - 1 byte + H - 2 byte + W - 4 byte + DW - 8 byte (eBPF only) + +Mode modifier is one of: + + BPF_IMM 0x00 /* classic BPF only, reserved in eBPF */ + BPF_ABS 0x20 + BPF_IND 0x40 + BPF_MEM 0x60 + BPF_LEN 0x80 /* classic BPF only, reserved in eBPF */ + BPF_MSH 0xa0 /* classic BPF only, reserved in eBPF */ + BPF_XADD 0xc0 /* eBPF only, exclusive add */ + +eBPF has two non-generic instructions: (BPF_ABS | <size> | BPF_LD) and +(BPF_IND | <size> | BPF_LD) which are used to access packet data. + +They had to be carried over from classic to have strong performance of +socket filters running in eBPF interpreter. These instructions can only +be used when interpreter context is a pointer to 'struct sk_buff' and +have seven implicit operands. Register R6 is an implicit input that must +contain pointer to sk_buff. Register R0 is an implicit output which contains +the data fetched from the packet. Registers R1-R5 are scratch registers +and must not be used to store the data across BPF_ABS | BPF_LD or +BPF_IND | BPF_LD instructions. + +These instructions have implicit program exit condition as well. When +eBPF program is trying to access the data beyond the packet boundary, +the interpreter will abort the execution of the program. JIT compilers +therefore must preserve this property. src_reg and imm32 fields are +explicit inputs to these instructions. + +For example: + + BPF_IND | BPF_W | BPF_LD means: + + R0 = ntohl(*(u32 *) (((struct sk_buff *) R6)->data + src_reg + imm32)) + and R1 - R5 were scratched. + +Unlike classic BPF instruction set, eBPF has generic load/store operations: + +BPF_MEM | <size> | BPF_STX: *(size *) (dst_reg + off) = src_reg +BPF_MEM | <size> | BPF_ST: *(size *) (dst_reg + off) = imm32 +BPF_MEM | <size> | BPF_LDX: dst_reg = *(size *) (src_reg + off) +BPF_XADD | BPF_W | BPF_STX: lock xadd *(u32 *)(dst_reg + off16) += src_reg +BPF_XADD | BPF_DW | BPF_STX: lock xadd *(u64 *)(dst_reg + off16) += src_reg + +Where size is one of: BPF_B or BPF_H or BPF_W or BPF_DW. Note that 1 and +2 byte atomic increments are not supported. + +Testing +------- + +Next to the BPF toolchain, the kernel also ships a test module that contains +various test cases for classic and internal BPF that can be executed against +the BPF interpreter and JIT compiler. It can be found in lib/test_bpf.c and +enabled via Kconfig: + + CONFIG_TEST_BPF=m + +After the module has been built and installed, the test suite can be executed +via insmod or modprobe against 'test_bpf' module. Results of the test cases +including timings in nsec can be found in the kernel log (dmesg). + Misc ---- |