Commits · 4fb74506838b3e34eaaebfcf90ebcd1fd52ab813 · Kirill Smelkov / linux

04 Nov, 2016 16 commits

David S. Miller authored Nov 04, 2016

Lorenzo Colitti says:

====================
net: inet: Support UID-based routing

This patchset adds support for per-UID routing. It allows the
administrator to configure rules such as:

  ip rule add uidrange 100-200 lookup 123

This functionality has been in use by all Android devices since
5.0. It is primarily used to impose per-app routing policies (on
Android, every app has its own UID) without having to resort to
rerouting packets in iptables, which breaks getsockname() and
MTU/MSS calculation, and generally disrupts end-to-end
connectivity.

This patch series is similar to the code currently used on
Android, but has better correctness and performance because
it stores the UID in the socket instead of calling sock_i_uid.
This avoids contention on sk->sk_callback_lock, and makes it
possible to correctly route a socket on which userspace has
called close(), for which sock_i_uid will return 0.

Changes from v1:
- Don't set the UID in sk_clone_lock, it's already set by
  sock_copy.
- For packets originated by kernel sockets, don't use the socket
  UID. This is the UID that created the namespace, but it might
  not be mapped in the namespace at all. Instead, use UID 0 in
  the namespace, which is less surprising and consistent with
  what happens in the root namespace.
- Fix UID routing of IPv4 and IPv6 SYN_RECV sockets.
- Fix UID routing of received IPv6 redirects.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>

4fb74506

net: inet: Support UID-based routing in IP protocols. · e2d118a1

Lorenzo Colitti authored Nov 04, 2016

- Use the UID in routing lookups made by protocol connect() and
  sendmsg() functions.
- Make sure that routing lookups triggered by incoming packets
  (e.g., Path MTU discovery) take the UID of the socket into
  account.
- For packets not associated with a userspace socket, (e.g., ping
  replies) use UID 0 inside the user namespace corresponding to
  the network namespace the socket belongs to. This allows
  all namespaces to apply routing and iptables rules to
  kernel-originated traffic in that namespaces by matching UID 0.
  This is better than using the UID of the kernel socket that is
  sending the traffic, because the UID of kernel sockets created
  at namespace creation time (e.g., the per-processor ICMP and
  TCP sockets) is the UID of the user that created the socket,
  which might not be mapped in the namespace.

Tested: compiles allnoconfig, allyesconfig, allmodconfig
Tested: https://android-review.googlesource.com/253302Signed-off-by: Lorenzo Colitti <lorenzo@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

e2d118a1

net: core: add UID to flows, rules, and routes · 622ec2c9

Lorenzo Colitti authored Nov 04, 2016

- Define a new FIB rule attributes, FRA_UID_RANGE, to describe a
  range of UIDs.
- Define a RTA_UID attribute for per-UID route lookups and dumps.
- Support passing these attributes to and from userspace via
  rtnetlink. The value INVALID_UID indicates no UID was
  specified.
- Add a UID field to the flow structures.
Signed-off-by: Lorenzo Colitti <lorenzo@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

622ec2c9

net: core: Add a UID field to struct sock. · 86741ec2

Lorenzo Colitti authored Nov 04, 2016

Protocol sockets (struct sock) don't have UIDs, but most of the
time, they map 1:1 to userspace sockets (struct socket) which do.

Various operations such as the iptables xt_owner match need
access to the "UID of a socket", and do so by following the
backpointer to the struct socket. This involves taking
sk_callback_lock and doesn't work when there is no socket
because userspace has already called close().

Simplify this by adding a sk_uid field to struct sock whose value
matches the UID of the corresponding struct socket. The semantics
are as follows:

1. Whenever sk_socket is non-null: sk_uid is the same as the UID
   in sk_socket, i.e., matches the return value of sock_i_uid.
   Specifically, the UID is set when userspace calls socket(),
   fchown(), or accept().
2. When sk_socket is NULL, sk_uid is defined as follows:
   - For a socket that no longer has a sk_socket because
     userspace has called close(): the previous UID.
   - For a cloned socket (e.g., an incoming connection that is
     established but on which userspace has not yet called
     accept): the UID of the socket it was cloned from.
   - For a socket that has never had an sk_socket: UID 0 inside
     the user namespace corresponding to the network namespace
     the socket belongs to.

Kernel sockets created by sock_create_kern are a special case
of #1 and sk_uid is the user that created them. For kernel
sockets created at network namespace creation time, such as the
per-processor ICMP and TCP sockets, this is the user that created
the network namespace.
Signed-off-by: Lorenzo Colitti <lorenzo@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

86741ec2

Merge branch 'dsa-mv88e6xxx-port-operation-refine' · 0d53072a

David S. Miller authored Nov 04, 2016

Vivien Didelot says:

====================
net: dsa: mv88e6xxx: refine port operations

The Marvell chips have one internal SMI device per port, containing a
set of registers used to configure a port's link, STP state, default
VLAN or addresses database, etc.

This patchset creates port files to implement the port operations as
described in datasheets, and extend the chip ops structure with them.

Patches 1 to 6 implement accessors for port's STP state, port based VLAN
map, default FID, default VID, and 802.1Q mode.

Patches 7 to 11 implement the port's MAC setup of link state, duplex
mode, RGMII delay and speed, all accessed through port's register 0x01.

The new port's MAC setup code is used to re-implement the adjust_link
code and correctly force the link down before changing any of the MAC
settings, as requested by the datasheets.

The port's MAC accessors use values compatible with struct phy_device
(e.g. DUPLEX_FULL) and extend them when needed (e.g. SPEED_MAX).

Changes in v2:

  - Strictly use new _UNFORCED values instead of re-using _UNKNOWN ones.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>

0d53072a

net: dsa: mv88e6xxx: setup port's MAC · d78343d2

Vivien Didelot authored Nov 04, 2016

Now that we have setters to configure the port's MAC, use them to
refactor the port setup and adjust_link code.

Note that port's MAC speed, duplex or RGMII delay must not be changed
unless the port's link is forced down. So wrap all that in a
mv88e6xxx_port_setup_mac function.
Signed-off-by: Vivien Didelot <vivien.didelot@savoirfairelinux.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

d78343d2

net: dsa: mv88e6xxx: add port's MAC speed setter · 96a2b40c

Vivien Didelot authored Nov 04, 2016

While the two bits for link, duplex or RGMII delays are used the same
way on chips supporting the said feature, the two bits for speed have
different meaning for most of the chips out there.

Speed value is stored in bits 1:0, 0x3 means unforce (normal detection).

Some chips reuse values for alternative speeds when bit 12 is set.

Newer chips with speed > 1Gbps reuse value 0x3 thus need a new bit 13.

Here are the values to write in register 0x1 to (un)force speed:

    | Speed   | 88E6065 | 88E6185 | 88E6352 | 88E6390 | 88E6390X |
    | ------- | ------- | ------- | ------- | ------- | -------- |
    | 10      | 0x0000  | 0x0000  | 0x0000  | 0x2000  | 0x2000   |
    | 100     | 0x0001  | 0x0001  | 0x0001  | 0x2001  | 0x2001   |
    | 200     | 0x0002  | NA      | 0x1001  | 0x3001  | 0x3001   |
    | 1000    | NA      | 0x0002  | 0x0002  | 0x2002  | 0x2002   |
    | 2500    | NA      | NA      | NA      | 0x3003  | 0x3003   |
    | 10000   | NA      | NA      | NA      | NA      | 0x2003   |
    | unforce | 0x0003  | 0x0003  | 0x0003  | 0x0000  | 0x0000   |

This patch implements a generic mv88e6xxx_port_set_speed() function used
by chip-specific wrappers to filter supported ports and speeds.
Signed-off-by: Vivien Didelot <vivien.didelot@savoirfairelinux.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

96a2b40c

net: dsa: mv88e6xxx: add port's RGMII delay setter · a0a0f622

Vivien Didelot authored Nov 04, 2016

Some chips such as 88E6352 and 88E6390 can be programmed to add delays
to RXCLK for IND inputs or to GTXCLK for OUTD outputs when port is in
RGMII mode.

Add a port function to program such delays according to the provided PHY
interface mode.
Signed-off-by: Vivien Didelot <vivien.didelot@savoirfairelinux.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

a0a0f622

net: dsa: mv88e6xxx: add port duplex setter · 7f1ae07b

Vivien Didelot authored Nov 04, 2016

Similarly to port's link, add setter to force port's half duplex, full
duplex or let normal duplex detection occurs.
Signed-off-by: Vivien Didelot <vivien.didelot@savoirfairelinux.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

7f1ae07b

net: dsa: mv88e6xxx: add port link setter · 08ef7f10

Vivien Didelot authored Nov 04, 2016

Most of the chips will have a port register control bits to force the
port's link up, down, or let normal link detection occurs.

Implement such operation to use it later when setting duplex, etc.
Signed-off-by: Vivien Didelot <vivien.didelot@savoirfairelinux.com>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>

08ef7f10

net: dsa: mv88e6xxx: add port 802.1Q mode setter · 385a0995

Vivien Didelot authored Nov 04, 2016

Add port functions to set the port 802.1Q mode.
Signed-off-by: Vivien Didelot <vivien.didelot@savoirfairelinux.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

385a0995

net: dsa: mv88e6xxx: add port PVID accessors · 77064f37

Vivien Didelot authored Nov 04, 2016

Add port functions to access the ports default VID.
Signed-off-by: Vivien Didelot <vivien.didelot@savoirfairelinux.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

77064f37

net: dsa: mv88e6xxx: add port FID accessors · b4e48c50

Vivien Didelot authored Nov 04, 2016

Add functions to port files to access the ports default FID.
Signed-off-by: Vivien Didelot <vivien.didelot@savoirfairelinux.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

b4e48c50

net: dsa: mv88e6xxx: add port vlan map setter · 5a7921f4

Vivien Didelot authored Nov 04, 2016

Add a port function to access the Port Based VLAN Map register.
Signed-off-by: Vivien Didelot <vivien.didelot@savoirfairelinux.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

5a7921f4

net: dsa: mv88e6xxx: add port state setter · e28def33

Vivien Didelot authored Nov 04, 2016

Add the port STP state setter to the port files.
Signed-off-by: Vivien Didelot <vivien.didelot@savoirfairelinux.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

e28def33

net: dsa: mv88e6xxx: add port files · 18abed21

Vivien Didelot authored Nov 04, 2016

The Marvell switches contains one internal SMI device per port, called
"Port Registers". Depending on the model, the addresses of these devices
start from 0x0, 0x8 or 0x10.

Start moving Port Registers specific code to their own files.
Signed-off-by: Vivien Didelot <vivien.didelot@savoirfairelinux.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

18abed21

03 Nov, 2016 12 commits

net/sched: cls_flower: Support matching on SCTP ports · 5976c5f4

Simon Horman authored Nov 03, 2016

Support matching on SCTP ports in the same way that matching
on TCP and UDP ports is already supported.

Example usage:

tc qdisc add dev eth0 ingress

tc filter add dev eth0 protocol ip parent ffff: \
        flower indev eth0 ip_proto sctp dst_port 80 \
        action drop
Signed-off-by: Simon Horman <simon.horman@netronome.com>
Acked-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

5976c5f4

mlxsw: pci: Fix the FW ready mask length · 5e5f89e7

Elad Raz authored Nov 03, 2016

The system-status register is actually 16-bit wide and not 8 bit-wide.

Fixes: 233fa44b ("mlxsw: pci: Implement reset done check")
Signed-off-by: Elad Raz <eladr@mellanox.com>
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

5e5f89e7

Merge branch 'ip-recvfragsize-cmsg' · a7991268

David S. Miller authored Nov 03, 2016

Willem de Bruijn says:

====================
ip: add RECVFRAGSIZE cmsg

On IP datagrams and raw sockets, when packets arrive fragmented,
expose the largest received fragment size through a new cmsg.

Protocols implemented on top of these sockets may use this, for
instance, to inform peers to lower MSS on platforms that silently
allow send calls to exceed PMTU and cause fragmentation.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>

a7991268

ipv6: on reassembly, record frag_max_size · dbd1759e

Willem de Bruijn authored Nov 02, 2016

IP6CB and IPCB have a frag_max_size field. In IPv6 this field is
filled in when packets are reassembled by the connection tracking
code. Also fill in when reassembling in the input path, to expose
it through cmsg IPV6_RECVFRAGSIZE in all cases.
Signed-off-by: Willem de Bruijn <willemb@google.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

dbd1759e

ipv6: add IPV6_RECVFRAGSIZE cmsg · 0cc0aa61

Willem de Bruijn authored Nov 02, 2016

When reading a datagram or raw packet that arrived fragmented, expose
the maximum fragment size if recorded to allow applications to
estimate receive path MTU.

At this point, the field is only recorded when ipv6 connection
tracking is enabled. A follow-up patch will record this field also
in the ipv6 input path.

Tested using the test for IP_RECVFRAGSIZE plus

  ip netns exec to ip addr add dev veth1 fc07::1/64
  ip netns exec from ip addr add dev veth0 fc07::2/64

  ip netns exec to ./recv_cmsg_recvfragsize -6 -u -p 6000 &
  ip netns exec from nc -q 1 -u fc07::1 6000 < payload

Both with and without enabling connection tracking

  ip6tables -A INPUT -m state --state NEW -p udp -j LOG
Signed-off-by: Willem de Bruijn <willemb@google.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

0cc0aa61

ipv4: add IP_RECVFRAGSIZE cmsg · 70ecc248

Willem de Bruijn authored Nov 02, 2016

The IP stack records the largest fragment of a reassembled packet
in IPCB(skb)->frag_max_size. When reading a datagram or raw packet
that arrived fragmented, expose the value to allow applications to
estimate receive path MTU.

Tested:
  Sent data over a veth pair of which the source has a small mtu.
  Sent data using netcat, received using a dedicated process.

  Verified that the cmsg IP_RECVFRAGSIZE is returned only when
  data arrives fragmented, and in that cases matches the veth mtu.

    ip link add veth0 type veth peer name veth1

    ip netns add from
    ip netns add to

    ip link set dev veth1 netns to
    ip netns exec to ip addr add dev veth1 192.168.10.1/24
    ip netns exec to ip link set dev veth1 up

    ip link set dev veth0 netns from
    ip netns exec from ip addr add dev veth0 192.168.10.2/24
    ip netns exec from ip link set dev veth0 up
    ip netns exec from ip link set dev veth0 mtu 1300
    ip netns exec from ethtool -K veth0 ufo off

    dd if=/dev/zero bs=1 count=1400 2>/dev/null > payload

    ip netns exec to ./recv_cmsg_recvfragsize -4 -u -p 6000 &
    ip netns exec from nc -q 1 -u 192.168.10.1 6000 < payload

  using github.com/wdebruij/kerneltools/blob/master/tests/recvfragsize.c
Signed-off-by: Willem de Bruijn <willemb@google.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

70ecc248

Merge branch 'stmmac-OXNAS' · 1b99e5e8

David S. Miller authored Nov 03, 2016

Neil Armstrong says:

====================
net: stmmac: Add OXNAS DWMAC Glue

This patchset add support for the Sysnopsys DWMAC Gigabit Ethernet
controller Glue layer of the Oxford Semiconductor OX820 SoC.

Changes since v2 at http://lkml.kernel.org/r/20161031105345.16711-1-narmstrong@baylibre.com :
 - Disable/Unprepare clock if regmap read fails in oxnas_dwmac_init

Changes since v1 at https://patchwork.kernel.org/patch/9388231/ :
 - Split dt-bindings in a separate patch
 - Add IP version in the dt-bindings compatible
 - Check return of clk_prepare_enable()
 - use get_stmmac_bsp_priv() helper
 - hardwire setup values in oxnas_dwmac_init()

Changes since RFC at https://patchwork.kernel.org/patch/9387257 :
 - Drop init/exit callbacks
 - Implement proper remove and PM callback
 - Call init from probe
 - Disable/Unprepare clock if stmmac probe fails
====================
Signed-off-by: David S. Miller <davem@davemloft.net>

1b99e5e8

dt-bindings: net: Add OXNAS DWMAC Bindings · 52b6c5c2

Neil Armstrong authored Nov 02, 2016

Signed-off-by: Neil Armstrong <narmstrong@baylibre.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

52b6c5c2

net: stmmac: Add OXNAS Glue Driver · 5ed74140

Neil Armstrong authored Nov 02, 2016

Add Synopsys Designware MAC Glue layer for the Oxford Semiconductor OX820.
Acked-by: Joachim Eastwood <manabian@gmail.com>
Signed-off-by: Neil Armstrong <narmstrong@baylibre.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

5ed74140

Merge branch 'diag-raw-fixes' · d3bc29a4

David S. Miller authored Nov 03, 2016

Cyrill Gorcunov says:

====================
net: Fixes for raw diag sockets handling

Hi! Here are a few fixes for raw-diag sockets handling: missing
sock_put call and jump for exiting from nested cycle. I made
patches for iproute2 as well so will send them out soon.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>

d3bc29a4

net: ip, raw_diag -- Use jump for exiting from nested loop · 9999370f

Cyrill Gorcunov authored Nov 02, 2016

I managed to miss that sk_for_each is called under "for"
cycle so need to use goto here to return matching socket.

CC: David S. Miller <davem@davemloft.net>
CC: Eric Dumazet <eric.dumazet@gmail.com>
CC: David Ahern <dsa@cumulusnetworks.com>
CC: Andrey Vagin <avagin@openvz.org>
CC: Stephen Hemminger <stephen@networkplumber.org>
Signed-off-by: Cyrill Gorcunov <gorcunov@openvz.org>
Acked-by: David Ahern <dsa@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

9999370f

net: ip, raw_diag -- Fix socket leaking for destroy request · cd05a0ec

Cyrill Gorcunov authored Nov 02, 2016

In raw_diag_destroy the helper raw_sock_get returns
with sock_hold call, so we have to put it then.

CC: David S. Miller <davem@davemloft.net>
CC: Eric Dumazet <eric.dumazet@gmail.com>
CC: David Ahern <dsa@cumulusnetworks.com>
CC: Andrey Vagin <avagin@openvz.org>
CC: Stephen Hemminger <stephen@networkplumber.org>
Signed-off-by: Cyrill Gorcunov <gorcunov@openvz.org>
Acked-by: David Ahern <dsa@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

cd05a0ec

02 Nov, 2016 12 commits

enic: set skb->hash type properly · 17197236

Govindarajulu Varadarajan authored Nov 01, 2016

Driver sets the skb l4/l3 hash based on NIC_CFG_RSS_HASH_TYPE_*,
which is bit mask. This is wrong. Hw actually provides us enum.
Use CQ_ENET_RQ_DESC_RSS_TYPE_* to set l3 and l4 hash type.

Fixes: bf751ba8 ("driver/net: enic: record q_number and rss_hash for skb")
Signed-off-by: Govindarajulu Varadarajan <_govind@gmx.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

17197236

net: 3com: typhoon: use new api ethtool_{get|set}_link_ksettings · f7a5537c

Philippe Reynes authored Nov 02, 2016

The ethtool api {get|set}_settings is deprecated.
We move this driver to new api {get|set}_link_ksettings.
Signed-off-by: Philippe Reynes <tremyfr@gmail.com>
Reviewed-by: David Dillow <dave@thedillows.org>
Signed-off-by: David S. Miller <davem@davemloft.net>

f7a5537c

ila: Fix crash caused by rhashtable changes · 1913540a

Tom Herbert authored Nov 01, 2016

commit ca26893f ("rhashtable: Add rhlist interface")
added a field to rhashtable_iter so that length became 56 bytes
and would exceed the size of args in netlink_callback (which is
48 bytes). The netlink diag dump function already has been
allocating a iter structure and storing the pointed to that
in the args of netlink_callback. ila_xlat also uses
rhahstable_iter but is still putting that directly in
the arg block. Now since rhashtable_iter size is increased
we are overwriting beyond the structure. The next field
happens to be cb_mutex pointer in netlink_sock and hence the crash.

Fix is to alloc the rhashtable_iter and save it as pointer
in arg.

Tested:

  modprobe ila
  ./ip ila add loc 3333:0:0:0 loc_match 2222:0:0:1,
  ./ip ila list  # NO crash now
Signed-off-by: Tom Herbert <tom@herbertland.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

1913540a

net: ip, diag -- Adjust raw_abort to use unlocked __udp_disconnect · 3de864f8

Cyrill Gorcunov authored Nov 01, 2016

While being preparing patches for killing raw sockets via
diag netlink interface I noticed that my runs are stuck:

 | [root@pcs7 ~]# cat /proc/`pidof ss`/stack
 | [<ffffffff816d1a76>] __lock_sock+0x80/0xc4
 | [<ffffffff816d206a>] lock_sock_nested+0x47/0x95
 | [<ffffffff8179ded6>] udp_disconnect+0x19/0x33
 | [<ffffffff8179b517>] raw_abort+0x33/0x42
 | [<ffffffff81702322>] sock_diag_destroy+0x4d/0x52

which has not been the case before. I narrowed it down to the commit

 | commit 286c72de
 | Author: Eric Dumazet <edumazet@google.com>
 | Date:   Thu Oct 20 09:39:40 2016 -0700
 |
 |     udp: must lock the socket in udp_disconnect()

where we start locking the socket for different reason.

So the raw_abort escaped the renaming and we have to
fix this typo using __udp_disconnect instead.

Fixes: 286c72de ("udp: must lock the socket in udp_disconnect()")
CC: David S. Miller <davem@davemloft.net>
CC: Eric Dumazet <eric.dumazet@gmail.com>
CC: David Ahern <dsa@cumulusnetworks.com>
CC: Alexey Kuznetsov <kuznet@ms2.inr.ac.ru>
CC: James Morris <jmorris@namei.org>
CC: Hideaki YOSHIFUJI <yoshfuji@linux-ipv6.org>
CC: Patrick McHardy <kaber@trash.net>
CC: Andrey Vagin <avagin@openvz.org>
CC: Stephen Hemminger <stephen@networkplumber.org>
Signed-off-by: Cyrill Gorcunov <gorcunov@openvz.org>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

3de864f8

lan78xx: Use irq_domain for phy interrupt from USB Int. EP · cc89c323

Woojung Huh authored Nov 01, 2016

To utilize phylib with interrupt fully than handling some of phy stuff in the MAC driver,
create irq_domain for USB interrupt EP of phy interrupt and
pass the irq number to phy_connect_direct() instead of PHY_IGNORE_INTERRUPT.

Idea comes from drivers/gpio/gpio-dl2.c
Signed-off-by: Woojung Huh <woojung.huh@microchip.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

cc89c323

tcp: enhance tcp collapsing · 2331ccc5

Eric Dumazet authored Nov 01, 2016

As Ilya Lesokhin suggested, we can collapse two skbs at retransmit
time even if the skb at the right has fragments.

We simply have to use more generic skb_copy_bits() instead of
skb_copy_from_linear_data() in tcp_collapse_retrans()

Also need to guard this skb_copy_bits() in case there is nothing to
copy, otherwise skb_put() could panic if left skb has frags.

Tested:

Used following packetdrill test

// Establish a connection.
0.000 socket(..., SOCK_STREAM, IPPROTO_TCP) = 3
   +0 setsockopt(3, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0
   +0 bind(3, ..., ...) = 0
   +0 listen(3, 1) = 0

   +0 < S 0:0(0) win 32792 <mss 1460,sackOK,nop,nop,nop,wscale 8>
   +0 > S. 0:0(0) ack 1 <mss 1460,nop,nop,sackOK,nop,wscale 8>
+.100 < . 1:1(0) ack 1 win 257
   +0 accept(3, ..., ...) = 4

   +0 setsockopt(4, SOL_TCP, TCP_NODELAY, [1], 4) = 0
   +0 write(4, ..., 200) = 200
   +0 > P. 1:201(200) ack 1
+.001 write(4, ..., 200) = 200
   +0 > P. 201:401(200) ack 1
+.001 write(4, ..., 200) = 200
   +0 > P. 401:601(200) ack 1
+.001 write(4, ..., 200) = 200
   +0 > P. 601:801(200) ack 1
+.001 write(4, ..., 200) = 200
   +0 > P. 801:1001(200) ack 1
+.001 write(4, ..., 100) = 100
   +0 > P. 1001:1101(100) ack 1
+.001 write(4, ..., 100) = 100
   +0 > P. 1101:1201(100) ack 1
+.001 write(4, ..., 100) = 100
   +0 > P. 1201:1301(100) ack 1
+.001 write(4, ..., 100) = 100
   +0 > P. 1301:1401(100) ack 1

+.100 < . 1:1(0) ack 1 win 257 <nop,nop,sack 1001:1401>
// Check that TCP collapse works :
   +0 > P. 1:1001(1000) ack 1
Reported-by: Ilya Lesokhin <ilyal@mellanox.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Neal Cardwell <ncardwell@google.com>
Cc: Yuchung Cheng <ycheng@google.com>
Cc: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi>
Acked-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

2331ccc5

net: 3c509: use new api ethtool_{get|set}_link_ksettings · b646cf29

Philippe Reynes authored Nov 01, 2016

The ethtool api {get|set}_settings is deprecated.
We move this driver to new api {get|set}_link_ksettings.
Signed-off-by: Philippe Reynes <tremyfr@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

b646cf29

net: 3c59x: use new api ethtool_{get|set}_link_ksettings · e19b7883

Philippe Reynes authored Nov 01, 2016

The ethtool api {get|set}_settings is deprecated.
We move this driver to new api {get|set}_link_ksettings.
Signed-off-by: Philippe Reynes <tremyfr@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

e19b7883

net: mii: add generic function to support ksetting support · bc8ee596

Philippe Reynes authored Nov 01, 2016

The old ethtool api (get_setting and set_setting) has generic mii
functions mii_ethtool_sset and mii_ethtool_gset.

To support the new ethtool api ({get|set}_link_ksettings), we add
two generics mii function mii_ethtool_{get|set}_link_ksettings_get.
Signed-off-by: Philippe Reynes <tremyfr@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

bc8ee596

Merge branch 'mlx4-XDP-tx-refactor' · 55454e9e

David S. Miller authored Nov 02, 2016

Tariq Toukan says:

====================
mlx4 XDP TX refactor

This patchset refactors the XDP forwarding case, so that
its dedicated transmit queues are managed in a complete
separation from the other regular ones.

It also adds ethtool counters for XDP cases.

Series generated against net-next commit:
22ca904a genetlink: fix error return code in genl_register_family()

Thanks,
Tariq.

v3:
* Exposed per ring counters.

v2:
* Added ethtool counters.
* Rebased, now patch 2 reverts Brenden's fix, as the bug no longer exists:
  958b3d39 ("net/mlx4_en: fixup xdp tx irq to match rx")
* Updated commit message of patch 2.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>

55454e9e

net/mlx4_en: Add ethtool statistics for XDP cases · 15fca2c8

Tariq Toukan authored Nov 02, 2016

XDP statistics are reported in ethtool, in total and per ring,
as follows:
- xdp_drop: the number of packets dropped by xdp.
- xdp_tx: the number of packets forwarded by xdp.
- xdp_tx_full: the number of times an xdp forward failed
	due to a full tx xdp ring.

In addition, all packets that are dropped/forwarded by XDP
are no longer accounted in rx_packets/rx_bytes of the ring,
so that they count traffic that is passed to the stack.
Signed-off-by: Tariq Toukan <tariqt@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

15fca2c8

net/mlx4_en: Refactor the XDP forwarding rings scheme · 67f8b1dc

Tariq Toukan authored Nov 02, 2016

Separately manage the two types of TX rings: regular ones, and XDP.
Upon an XDP set, do not borrow regular TX rings and convert them
into XDP ones, but allocate new ones, unless we hit the max number
of rings.
Which means that in systems with smaller #cores we will not consume
the current TX rings for XDP, while we are still in the num TX limit.

XDP TX rings counters are not shown in ethtool statistics.
Instead, XDP counters will be added to the respective RX rings
in a downstream patch.

This has no performance implications.
Signed-off-by: Tariq Toukan <tariqt@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

67f8b1dc