1. 04 Nov, 2016 16 commits
    • David S. Miller's avatar
      Merge branch 'uid-routing' · 4fb74506
      David S. Miller authored
      Lorenzo Colitti says:
      
      ====================
      net: inet: Support UID-based routing
      
      This patchset adds support for per-UID routing. It allows the
      administrator to configure rules such as:
      
        ip rule add uidrange 100-200 lookup 123
      
      This functionality has been in use by all Android devices since
      5.0. It is primarily used to impose per-app routing policies (on
      Android, every app has its own UID) without having to resort to
      rerouting packets in iptables, which breaks getsockname() and
      MTU/MSS calculation, and generally disrupts end-to-end
      connectivity.
      
      This patch series is similar to the code currently used on
      Android, but has better correctness and performance because
      it stores the UID in the socket instead of calling sock_i_uid.
      This avoids contention on sk->sk_callback_lock, and makes it
      possible to correctly route a socket on which userspace has
      called close(), for which sock_i_uid will return 0.
      
      Changes from v1:
      - Don't set the UID in sk_clone_lock, it's already set by
        sock_copy.
      - For packets originated by kernel sockets, don't use the socket
        UID. This is the UID that created the namespace, but it might
        not be mapped in the namespace at all. Instead, use UID 0 in
        the namespace, which is less surprising and consistent with
        what happens in the root namespace.
      - Fix UID routing of IPv4 and IPv6 SYN_RECV sockets.
      - Fix UID routing of received IPv6 redirects.
      ====================
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      4fb74506
    • Lorenzo Colitti's avatar
      net: inet: Support UID-based routing in IP protocols. · e2d118a1
      Lorenzo Colitti authored
      - Use the UID in routing lookups made by protocol connect() and
        sendmsg() functions.
      - Make sure that routing lookups triggered by incoming packets
        (e.g., Path MTU discovery) take the UID of the socket into
        account.
      - For packets not associated with a userspace socket, (e.g., ping
        replies) use UID 0 inside the user namespace corresponding to
        the network namespace the socket belongs to. This allows
        all namespaces to apply routing and iptables rules to
        kernel-originated traffic in that namespaces by matching UID 0.
        This is better than using the UID of the kernel socket that is
        sending the traffic, because the UID of kernel sockets created
        at namespace creation time (e.g., the per-processor ICMP and
        TCP sockets) is the UID of the user that created the socket,
        which might not be mapped in the namespace.
      
      Tested: compiles allnoconfig, allyesconfig, allmodconfig
      Tested: https://android-review.googlesource.com/253302Signed-off-by: default avatarLorenzo Colitti <lorenzo@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      e2d118a1
    • Lorenzo Colitti's avatar
      net: core: add UID to flows, rules, and routes · 622ec2c9
      Lorenzo Colitti authored
      - Define a new FIB rule attributes, FRA_UID_RANGE, to describe a
        range of UIDs.
      - Define a RTA_UID attribute for per-UID route lookups and dumps.
      - Support passing these attributes to and from userspace via
        rtnetlink. The value INVALID_UID indicates no UID was
        specified.
      - Add a UID field to the flow structures.
      Signed-off-by: default avatarLorenzo Colitti <lorenzo@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      622ec2c9
    • Lorenzo Colitti's avatar
      net: core: Add a UID field to struct sock. · 86741ec2
      Lorenzo Colitti authored
      Protocol sockets (struct sock) don't have UIDs, but most of the
      time, they map 1:1 to userspace sockets (struct socket) which do.
      
      Various operations such as the iptables xt_owner match need
      access to the "UID of a socket", and do so by following the
      backpointer to the struct socket. This involves taking
      sk_callback_lock and doesn't work when there is no socket
      because userspace has already called close().
      
      Simplify this by adding a sk_uid field to struct sock whose value
      matches the UID of the corresponding struct socket. The semantics
      are as follows:
      
      1. Whenever sk_socket is non-null: sk_uid is the same as the UID
         in sk_socket, i.e., matches the return value of sock_i_uid.
         Specifically, the UID is set when userspace calls socket(),
         fchown(), or accept().
      2. When sk_socket is NULL, sk_uid is defined as follows:
         - For a socket that no longer has a sk_socket because
           userspace has called close(): the previous UID.
         - For a cloned socket (e.g., an incoming connection that is
           established but on which userspace has not yet called
           accept): the UID of the socket it was cloned from.
         - For a socket that has never had an sk_socket: UID 0 inside
           the user namespace corresponding to the network namespace
           the socket belongs to.
      
      Kernel sockets created by sock_create_kern are a special case
      of #1 and sk_uid is the user that created them. For kernel
      sockets created at network namespace creation time, such as the
      per-processor ICMP and TCP sockets, this is the user that created
      the network namespace.
      Signed-off-by: default avatarLorenzo Colitti <lorenzo@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      86741ec2
    • David S. Miller's avatar
      Merge branch 'dsa-mv88e6xxx-port-operation-refine' · 0d53072a
      David S. Miller authored
      Vivien Didelot says:
      
      ====================
      net: dsa: mv88e6xxx: refine port operations
      
      The Marvell chips have one internal SMI device per port, containing a
      set of registers used to configure a port's link, STP state, default
      VLAN or addresses database, etc.
      
      This patchset creates port files to implement the port operations as
      described in datasheets, and extend the chip ops structure with them.
      
      Patches 1 to 6 implement accessors for port's STP state, port based VLAN
      map, default FID, default VID, and 802.1Q mode.
      
      Patches 7 to 11 implement the port's MAC setup of link state, duplex
      mode, RGMII delay and speed, all accessed through port's register 0x01.
      
      The new port's MAC setup code is used to re-implement the adjust_link
      code and correctly force the link down before changing any of the MAC
      settings, as requested by the datasheets.
      
      The port's MAC accessors use values compatible with struct phy_device
      (e.g. DUPLEX_FULL) and extend them when needed (e.g. SPEED_MAX).
      
      Changes in v2:
      
        - Strictly use new _UNFORCED values instead of re-using _UNKNOWN ones.
      ====================
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      0d53072a
    • Vivien Didelot's avatar
      net: dsa: mv88e6xxx: setup port's MAC · d78343d2
      Vivien Didelot authored
      Now that we have setters to configure the port's MAC, use them to
      refactor the port setup and adjust_link code.
      
      Note that port's MAC speed, duplex or RGMII delay must not be changed
      unless the port's link is forced down. So wrap all that in a
      mv88e6xxx_port_setup_mac function.
      Signed-off-by: default avatarVivien Didelot <vivien.didelot@savoirfairelinux.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      d78343d2
    • Vivien Didelot's avatar
      net: dsa: mv88e6xxx: add port's MAC speed setter · 96a2b40c
      Vivien Didelot authored
      While the two bits for link, duplex or RGMII delays are used the same
      way on chips supporting the said feature, the two bits for speed have
      different meaning for most of the chips out there.
      
      Speed value is stored in bits 1:0, 0x3 means unforce (normal detection).
      
      Some chips reuse values for alternative speeds when bit 12 is set.
      
      Newer chips with speed > 1Gbps reuse value 0x3 thus need a new bit 13.
      
      Here are the values to write in register 0x1 to (un)force speed:
      
          | Speed   | 88E6065 | 88E6185 | 88E6352 | 88E6390 | 88E6390X |
          | ------- | ------- | ------- | ------- | ------- | -------- |
          | 10      | 0x0000  | 0x0000  | 0x0000  | 0x2000  | 0x2000   |
          | 100     | 0x0001  | 0x0001  | 0x0001  | 0x2001  | 0x2001   |
          | 200     | 0x0002  | NA      | 0x1001  | 0x3001  | 0x3001   |
          | 1000    | NA      | 0x0002  | 0x0002  | 0x2002  | 0x2002   |
          | 2500    | NA      | NA      | NA      | 0x3003  | 0x3003   |
          | 10000   | NA      | NA      | NA      | NA      | 0x2003   |
          | unforce | 0x0003  | 0x0003  | 0x0003  | 0x0000  | 0x0000   |
      
      This patch implements a generic mv88e6xxx_port_set_speed() function used
      by chip-specific wrappers to filter supported ports and speeds.
      Signed-off-by: default avatarVivien Didelot <vivien.didelot@savoirfairelinux.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      96a2b40c
    • Vivien Didelot's avatar
      net: dsa: mv88e6xxx: add port's RGMII delay setter · a0a0f622
      Vivien Didelot authored
      Some chips such as 88E6352 and 88E6390 can be programmed to add delays
      to RXCLK for IND inputs or to GTXCLK for OUTD outputs when port is in
      RGMII mode.
      
      Add a port function to program such delays according to the provided PHY
      interface mode.
      Signed-off-by: default avatarVivien Didelot <vivien.didelot@savoirfairelinux.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      a0a0f622
    • Vivien Didelot's avatar
      net: dsa: mv88e6xxx: add port duplex setter · 7f1ae07b
      Vivien Didelot authored
      Similarly to port's link, add setter to force port's half duplex, full
      duplex or let normal duplex detection occurs.
      Signed-off-by: default avatarVivien Didelot <vivien.didelot@savoirfairelinux.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      7f1ae07b
    • Vivien Didelot's avatar
      net: dsa: mv88e6xxx: add port link setter · 08ef7f10
      Vivien Didelot authored
      Most of the chips will have a port register control bits to force the
      port's link up, down, or let normal link detection occurs.
      
      Implement such operation to use it later when setting duplex, etc.
      Signed-off-by: default avatarVivien Didelot <vivien.didelot@savoirfairelinux.com>
      Reviewed-by: default avatarAndrew Lunn <andrew@lunn.ch>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      08ef7f10
    • Vivien Didelot's avatar
      net: dsa: mv88e6xxx: add port 802.1Q mode setter · 385a0995
      Vivien Didelot authored
      Add port functions to set the port 802.1Q mode.
      Signed-off-by: default avatarVivien Didelot <vivien.didelot@savoirfairelinux.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      385a0995
    • Vivien Didelot's avatar
      net: dsa: mv88e6xxx: add port PVID accessors · 77064f37
      Vivien Didelot authored
      Add port functions to access the ports default VID.
      Signed-off-by: default avatarVivien Didelot <vivien.didelot@savoirfairelinux.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      77064f37
    • Vivien Didelot's avatar
      net: dsa: mv88e6xxx: add port FID accessors · b4e48c50
      Vivien Didelot authored
      Add functions to port files to access the ports default FID.
      Signed-off-by: default avatarVivien Didelot <vivien.didelot@savoirfairelinux.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      b4e48c50
    • Vivien Didelot's avatar
      net: dsa: mv88e6xxx: add port vlan map setter · 5a7921f4
      Vivien Didelot authored
      Add a port function to access the Port Based VLAN Map register.
      Signed-off-by: default avatarVivien Didelot <vivien.didelot@savoirfairelinux.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      5a7921f4
    • Vivien Didelot's avatar
      net: dsa: mv88e6xxx: add port state setter · e28def33
      Vivien Didelot authored
      Add the port STP state setter to the port files.
      Signed-off-by: default avatarVivien Didelot <vivien.didelot@savoirfairelinux.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      e28def33
    • Vivien Didelot's avatar
      net: dsa: mv88e6xxx: add port files · 18abed21
      Vivien Didelot authored
      The Marvell switches contains one internal SMI device per port, called
      "Port Registers". Depending on the model, the addresses of these devices
      start from 0x0, 0x8 or 0x10.
      
      Start moving Port Registers specific code to their own files.
      Signed-off-by: default avatarVivien Didelot <vivien.didelot@savoirfairelinux.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      18abed21
  2. 03 Nov, 2016 12 commits
  3. 02 Nov, 2016 12 commits
    • Govindarajulu Varadarajan's avatar
      enic: set skb->hash type properly · 17197236
      Govindarajulu Varadarajan authored
      Driver sets the skb l4/l3 hash based on NIC_CFG_RSS_HASH_TYPE_*,
      which is bit mask. This is wrong. Hw actually provides us enum.
      Use CQ_ENET_RQ_DESC_RSS_TYPE_* to set l3 and l4 hash type.
      
      Fixes: bf751ba8 ("driver/net: enic: record q_number and rss_hash for skb")
      Signed-off-by: default avatarGovindarajulu Varadarajan <_govind@gmx.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      17197236
    • Philippe Reynes's avatar
      net: 3com: typhoon: use new api ethtool_{get|set}_link_ksettings · f7a5537c
      Philippe Reynes authored
      The ethtool api {get|set}_settings is deprecated.
      We move this driver to new api {get|set}_link_ksettings.
      Signed-off-by: default avatarPhilippe Reynes <tremyfr@gmail.com>
      Reviewed-by: default avatarDavid Dillow <dave@thedillows.org>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      f7a5537c
    • Tom Herbert's avatar
      ila: Fix crash caused by rhashtable changes · 1913540a
      Tom Herbert authored
      commit ca26893f ("rhashtable: Add rhlist interface")
      added a field to rhashtable_iter so that length became 56 bytes
      and would exceed the size of args in netlink_callback (which is
      48 bytes). The netlink diag dump function already has been
      allocating a iter structure and storing the pointed to that
      in the args of netlink_callback. ila_xlat also uses
      rhahstable_iter but is still putting that directly in
      the arg block. Now since rhashtable_iter size is increased
      we are overwriting beyond the structure. The next field
      happens to be cb_mutex pointer in netlink_sock and hence the crash.
      
      Fix is to alloc the rhashtable_iter and save it as pointer
      in arg.
      
      Tested:
      
        modprobe ila
        ./ip ila add loc 3333:0:0:0 loc_match 2222:0:0:1,
        ./ip ila list  # NO crash now
      Signed-off-by: default avatarTom Herbert <tom@herbertland.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      1913540a
    • Cyrill Gorcunov's avatar
      net: ip, diag -- Adjust raw_abort to use unlocked __udp_disconnect · 3de864f8
      Cyrill Gorcunov authored
      While being preparing patches for killing raw sockets via
      diag netlink interface I noticed that my runs are stuck:
      
       | [root@pcs7 ~]# cat /proc/`pidof ss`/stack
       | [<ffffffff816d1a76>] __lock_sock+0x80/0xc4
       | [<ffffffff816d206a>] lock_sock_nested+0x47/0x95
       | [<ffffffff8179ded6>] udp_disconnect+0x19/0x33
       | [<ffffffff8179b517>] raw_abort+0x33/0x42
       | [<ffffffff81702322>] sock_diag_destroy+0x4d/0x52
      
      which has not been the case before. I narrowed it down to the commit
      
       | commit 286c72de
       | Author: Eric Dumazet <edumazet@google.com>
       | Date:   Thu Oct 20 09:39:40 2016 -0700
       |
       |     udp: must lock the socket in udp_disconnect()
      
      where we start locking the socket for different reason.
      
      So the raw_abort escaped the renaming and we have to
      fix this typo using __udp_disconnect instead.
      
      Fixes: 286c72de ("udp: must lock the socket in udp_disconnect()")
      CC: David S. Miller <davem@davemloft.net>
      CC: Eric Dumazet <eric.dumazet@gmail.com>
      CC: David Ahern <dsa@cumulusnetworks.com>
      CC: Alexey Kuznetsov <kuznet@ms2.inr.ac.ru>
      CC: James Morris <jmorris@namei.org>
      CC: Hideaki YOSHIFUJI <yoshfuji@linux-ipv6.org>
      CC: Patrick McHardy <kaber@trash.net>
      CC: Andrey Vagin <avagin@openvz.org>
      CC: Stephen Hemminger <stephen@networkplumber.org>
      Signed-off-by: default avatarCyrill Gorcunov <gorcunov@openvz.org>
      Acked-by: default avatarEric Dumazet <edumazet@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      3de864f8
    • Woojung Huh's avatar
      lan78xx: Use irq_domain for phy interrupt from USB Int. EP · cc89c323
      Woojung Huh authored
      To utilize phylib with interrupt fully than handling some of phy stuff in the MAC driver,
      create irq_domain for USB interrupt EP of phy interrupt and
      pass the irq number to phy_connect_direct() instead of PHY_IGNORE_INTERRUPT.
      
      Idea comes from drivers/gpio/gpio-dl2.c
      Signed-off-by: default avatarWoojung Huh <woojung.huh@microchip.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      cc89c323
    • Eric Dumazet's avatar
      tcp: enhance tcp collapsing · 2331ccc5
      Eric Dumazet authored
      As Ilya Lesokhin suggested, we can collapse two skbs at retransmit
      time even if the skb at the right has fragments.
      
      We simply have to use more generic skb_copy_bits() instead of
      skb_copy_from_linear_data() in tcp_collapse_retrans()
      
      Also need to guard this skb_copy_bits() in case there is nothing to
      copy, otherwise skb_put() could panic if left skb has frags.
      
      Tested:
      
      Used following packetdrill test
      
      // Establish a connection.
      0.000 socket(..., SOCK_STREAM, IPPROTO_TCP) = 3
         +0 setsockopt(3, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0
         +0 bind(3, ..., ...) = 0
         +0 listen(3, 1) = 0
      
         +0 < S 0:0(0) win 32792 <mss 1460,sackOK,nop,nop,nop,wscale 8>
         +0 > S. 0:0(0) ack 1 <mss 1460,nop,nop,sackOK,nop,wscale 8>
      +.100 < . 1:1(0) ack 1 win 257
         +0 accept(3, ..., ...) = 4
      
         +0 setsockopt(4, SOL_TCP, TCP_NODELAY, [1], 4) = 0
         +0 write(4, ..., 200) = 200
         +0 > P. 1:201(200) ack 1
      +.001 write(4, ..., 200) = 200
         +0 > P. 201:401(200) ack 1
      +.001 write(4, ..., 200) = 200
         +0 > P. 401:601(200) ack 1
      +.001 write(4, ..., 200) = 200
         +0 > P. 601:801(200) ack 1
      +.001 write(4, ..., 200) = 200
         +0 > P. 801:1001(200) ack 1
      +.001 write(4, ..., 100) = 100
         +0 > P. 1001:1101(100) ack 1
      +.001 write(4, ..., 100) = 100
         +0 > P. 1101:1201(100) ack 1
      +.001 write(4, ..., 100) = 100
         +0 > P. 1201:1301(100) ack 1
      +.001 write(4, ..., 100) = 100
         +0 > P. 1301:1401(100) ack 1
      
      +.100 < . 1:1(0) ack 1 win 257 <nop,nop,sack 1001:1401>
      // Check that TCP collapse works :
         +0 > P. 1:1001(1000) ack 1
      Reported-by: default avatarIlya Lesokhin <ilyal@mellanox.com>
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Cc: Neal Cardwell <ncardwell@google.com>
      Cc: Yuchung Cheng <ycheng@google.com>
      Cc: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi>
      Acked-by: default avatarNeal Cardwell <ncardwell@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      2331ccc5
    • Philippe Reynes's avatar
      net: 3c509: use new api ethtool_{get|set}_link_ksettings · b646cf29
      Philippe Reynes authored
      The ethtool api {get|set}_settings is deprecated.
      We move this driver to new api {get|set}_link_ksettings.
      Signed-off-by: default avatarPhilippe Reynes <tremyfr@gmail.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      b646cf29
    • Philippe Reynes's avatar
      net: 3c59x: use new api ethtool_{get|set}_link_ksettings · e19b7883
      Philippe Reynes authored
      The ethtool api {get|set}_settings is deprecated.
      We move this driver to new api {get|set}_link_ksettings.
      Signed-off-by: default avatarPhilippe Reynes <tremyfr@gmail.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      e19b7883
    • Philippe Reynes's avatar
      net: mii: add generic function to support ksetting support · bc8ee596
      Philippe Reynes authored
      The old ethtool api (get_setting and set_setting) has generic mii
      functions mii_ethtool_sset and mii_ethtool_gset.
      
      To support the new ethtool api ({get|set}_link_ksettings), we add
      two generics mii function mii_ethtool_{get|set}_link_ksettings_get.
      Signed-off-by: default avatarPhilippe Reynes <tremyfr@gmail.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      bc8ee596
    • David S. Miller's avatar
      Merge branch 'mlx4-XDP-tx-refactor' · 55454e9e
      David S. Miller authored
      Tariq Toukan says:
      
      ====================
      mlx4 XDP TX refactor
      
      This patchset refactors the XDP forwarding case, so that
      its dedicated transmit queues are managed in a complete
      separation from the other regular ones.
      
      It also adds ethtool counters for XDP cases.
      
      Series generated against net-next commit:
      22ca904a genetlink: fix error return code in genl_register_family()
      
      Thanks,
      Tariq.
      
      v3:
      * Exposed per ring counters.
      
      v2:
      * Added ethtool counters.
      * Rebased, now patch 2 reverts Brenden's fix, as the bug no longer exists:
        958b3d39 ("net/mlx4_en: fixup xdp tx irq to match rx")
      * Updated commit message of patch 2.
      ====================
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      55454e9e
    • Tariq Toukan's avatar
      net/mlx4_en: Add ethtool statistics for XDP cases · 15fca2c8
      Tariq Toukan authored
      XDP statistics are reported in ethtool, in total and per ring,
      as follows:
      - xdp_drop: the number of packets dropped by xdp.
      - xdp_tx: the number of packets forwarded by xdp.
      - xdp_tx_full: the number of times an xdp forward failed
      	due to a full tx xdp ring.
      
      In addition, all packets that are dropped/forwarded by XDP
      are no longer accounted in rx_packets/rx_bytes of the ring,
      so that they count traffic that is passed to the stack.
      Signed-off-by: default avatarTariq Toukan <tariqt@mellanox.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      15fca2c8
    • Tariq Toukan's avatar
      net/mlx4_en: Refactor the XDP forwarding rings scheme · 67f8b1dc
      Tariq Toukan authored
      Separately manage the two types of TX rings: regular ones, and XDP.
      Upon an XDP set, do not borrow regular TX rings and convert them
      into XDP ones, but allocate new ones, unless we hit the max number
      of rings.
      Which means that in systems with smaller #cores we will not consume
      the current TX rings for XDP, while we are still in the num TX limit.
      
      XDP TX rings counters are not shown in ethtool statistics.
      Instead, XDP counters will be added to the respective RX rings
      in a downstream patch.
      
      This has no performance implications.
      Signed-off-by: default avatarTariq Toukan <tariqt@mellanox.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      67f8b1dc