1. 01 May, 2018 40 commits
    • Soheil Hassas Yeganeh's avatar
      tcp: send in-queue bytes in cmsg upon read · b75eba76
      Soheil Hassas Yeganeh authored
      Applications with many concurrent connections, high variance
      in receive queue length and tight memory bounds cannot
      allocate worst-case buffer size to drain sockets. Knowing
      the size of receive queue length, applications can optimize
      how they allocate buffers to read from the socket.
      
      The number of bytes pending on the socket is directly
      available through ioctl(FIONREAD/SIOCINQ) and can be
      approximated using getsockopt(MEMINFO) (rmem_alloc includes
      skb overheads in addition to application data). But, both of
      these options add an extra syscall per recvmsg. Moreover,
      ioctl(FIONREAD/SIOCINQ) takes the socket lock.
      
      Add the TCP_INQ socket option to TCP. When this socket
      option is set, recvmsg() relays the number of bytes available
      on the socket for reading to the application via the
      TCP_CM_INQ control message.
      
      Calculate the number of bytes after releasing the socket lock
      to include the processed backlog, if any. To avoid an extra
      branch in the hot path of recvmsg() for this new control
      message, move all cmsg processing inside an existing branch for
      processing receive timestamps. Since the socket lock is not held
      when calculating the size of receive queue, TCP_INQ is a hint.
      For example, it can overestimate the queue size by one byte,
      if FIN is received.
      
      With this method, applications can start reading from the socket
      using a small buffer, and then use larger buffers based on the
      remaining data when needed.
      
      V3 change-log:
      	As suggested by David Miller, added loads with barrier
      	to check whether we have multiple threads calling recvmsg
      	in parallel. When that happens we lock the socket to
      	calculate inq.
      V4 change-log:
      	Removed inline from a static function.
      Signed-off-by: default avatarSoheil Hassas Yeganeh <soheil@google.com>
      Signed-off-by: default avatarYuchung Cheng <ycheng@google.com>
      Signed-off-by: default avatarWillem de Bruijn <willemb@google.com>
      Reviewed-by: default avatarEric Dumazet <edumazet@google.com>
      Reviewed-by: default avatarNeal Cardwell <ncardwell@google.com>
      Suggested-by: default avatarDavid Miller <davem@davemloft.net>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      b75eba76
    • David S. Miller's avatar
      Merge branch 'hns3-fixes' · ab85539e
      David S. Miller authored
      Salil Mehta says:
      
      ====================
      Misc bug fixes for HNS3 Ethernet driver
      
      This patch-set presents some miscellaneous bug fixs and cleanups for
      HNS3 Ethernet Driver.
      ====================
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      ab85539e
    • Xi Wang's avatar
      net: hns3: Remove packet statistics in the range of 8192~12287 · dbecc779
      Xi Wang authored
      Because the current statistics for size 8192~12287 are only valid for GE,
      the ranges of 8192~9216 and 9217~12287 are valid only for LGE/CGE, and are
      always 0 for GE interfaces. it is easy to cause confusion when viewing the
      packet statistics using the command ethtool -S.
      
      This patch removes the 8192~12287 range of packet statistics and uses the
      8192~9216 and 9217~12287 ranges for statistics. This change depends on the
      firmware upgrade.
      Signed-off-by: default avatarXi Wang <wangxi11@huawei.com>
      Signed-off-by: default avatarPeng Li <lipeng321@huawei.com>
      Signed-off-by: default avatarSalil Mehta <salil.mehta@huawei.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      dbecc779
    • Yunsheng Lin's avatar
      net: hns3: Fix for packet loss due wrong filter config in VLAN tbls · dc8131d8
      Yunsheng Lin authored
      There are two level of vlan tables in hardware, one is port vlan
      which is shared by all functions, the other one is function
      vlan table, each function has it's own function vlan table.
      Currently, PF sets the port vlan table, and vf sets the function
      vlan table, which will cause packet lost problem.
      
      This patch fixes this problem by setting both vlan table, and
      use hdev->vlan_table to manage thet port vlan table.
      Signed-off-by: default avatarYunsheng Lin <linyunsheng@huawei.com>
      Signed-off-by: default avatarPeng Li <lipeng321@huawei.com>
      Signed-off-by: default avatarSalil Mehta <salil.mehta@huawei.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      dc8131d8
    • Huazhong Tan's avatar
      net: hns3: fix a dead loop in hclge_cmd_csq_clean · 3ff50490
      Huazhong Tan authored
      If head has invlid value then a dead loop can be triggered in
      hclge_cmd_csq_clean. This patch adds sanity check for this case.
      
      Fixes: 68c0a5c7 ("net: hns3: Add HNS3 IMP(Integrated Mgmt Proc) Cmd
      Interface Support")
      Signed-off-by: default avatarHuazhong Tan <tanhuazhong@huawei.com>
      Signed-off-by: default avatarPeng Li <lipeng321@huawei.com>
      Signed-off-by: default avatarSalil Mehta <salil.mehta@huawei.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      3ff50490
    • Fuyun Liang's avatar
      net: hns3: Fix to support autoneg only for port attached with phy · 0c963e8c
      Fuyun Liang authored
      This patch adds a check to support autoneg(ethtool -A) only when PHY
      is attached with the port.
      
      Fixes: e2cb1dec ("net: hns3: Add HNS3 VF HCL(Hardware Compatibility
      Layer) Support")
      Signed-off-by: default avatarFuyun Liang <liangfuyun1@huawei.com>
      Signed-off-by: default avatarPeng Li <lipeng321@huawei.com>
      Signed-off-by: default avatarSalil Mehta <salil.mehta@huawei.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      0c963e8c
    • Huazhong Tan's avatar
      net: hns3: fix for phy_addr error in hclge_mac_mdio_config · c5ef83cb
      Huazhong Tan authored
      When phy exists, phy_addr must less than PHY_MAX_ADDR.
      If not, hclge_mac_mdio_config should return error.
      And for fiber(phy_addr=0xff), it does not need hclge_mac_mdio_config.
      Signed-off-by: default avatarHuazhong Tan <tanhuazhong@huawei.com>
      Signed-off-by: default avatarPeng Li <lipeng321@huawei.com>
      Signed-off-by: default avatarSalil Mehta <salil.mehta@huawei.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      c5ef83cb
    • Huazhong Tan's avatar
      net: hns3: Fixes the error legs in hclge_init_ae_dev function · ffd5656e
      Huazhong Tan authored
      This patch fixes some of the missed error legs in the initialization
      function of the ae device. This might cause leaks in case of failure.
      
      Fixes: 46a3df9f ("net: hns3: Add HNS3 Acceleration Engine & Compatibility Layer
      Support")
      Signed-off-by: default avatarHuazhong Tan <tanhuazhong@huawei.com>
      Signed-off-by: default avatarPeng Li <lipeng321@huawei.com>
      Signed-off-by: default avatarSalil Mehta <salil.mehta@huawei.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      ffd5656e
    • Huazhong Tan's avatar
      net: hns3: Fixes the out of bounds access in hclge_map_tqp · 38e62046
      Huazhong Tan authored
      This patch fixes the handling of the check when number of vports
      are detected to be more than available TPQs. Current handling causes
      an out of bounds access in hclge_map_tqp().
      
      Fixes: 7df7dad6 ("net: hns3: Refactor the mapping of tqp to vport")
      Signed-off-by: default avatarHuazhong Tan <tanhuazhong@huawei.com>
      Signed-off-by: default avatarPeng Li <lipeng321@huawei.com>
      Signed-off-by: default avatarSalil Mehta <salil.mehta@huawei.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      38e62046
    • Huazhong Tan's avatar
      net: hns3: fix to correctly fetch l4 protocol outer header · 35f58fd7
      Huazhong Tan authored
      This patch fixes the function being used to fetch L4
      protocol outer header. Mistakenly skb_inner_transport_header
      API was being used earlier.
      
      Fixes: 76ad4f0e ("net: hns3: Add support of HNS3 Ethernet Driver for hip08 SoC")
      Signed-off-by: default avatarHuazhong Tan <tanhuazhong@huawei.com>
      Signed-off-by: default avatarPeng Li <lipeng321@huawei.com>
      Signed-off-by: default avatarSalil Mehta <salil.mehta@huawei.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      35f58fd7
    • Yunsheng Lin's avatar
      net: hns3: Remove error log when getting pfc stats fails · 20670328
      Yunsheng Lin authored
      When mac supports DCB, but is in GE mode, it does not support
      querying pfc stats, firmware returns error when trying to
      query the pfc stats. this creates a lot of noise in the kernel
      log when it prints the error log.
      
      This patch fixes it by removing the error log, because it already
      return the error to the user space, so the user should be aware of
      the error.
      Signed-off-by: default avatarYunsheng Lin <linyunsheng@huawei.com>
      Signed-off-by: default avatarPeng Li <lipeng321@huawei.com>
      Signed-off-by: default avatarSalil Mehta <salil.mehta@huawei.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      20670328
    • Stefan Strogin's avatar
      connector: add parent pid and tgid to coredump and exit events · b086ff87
      Stefan Strogin authored
      The intention is to get notified of process failures as soon
      as possible, before a possible core dumping (which could be very long)
      (e.g. in some process-manager). Coredump and exit process events
      are perfect for such use cases (see 2b5faa4c "connector: Added
      coredumping event to the process connector").
      
      The problem is that for now the process-manager cannot know the parent
      of a dying process using connectors. This could be useful if the
      process-manager should monitor for failures only children of certain
      parents, so we could filter the coredump and exit events by parent
      process and/or thread ID.
      
      Add parent pid and tgid to coredump and exit process connectors event
      data.
      Signed-off-by: default avatarStefan Strogin <sstrogin@cisco.com>
      Acked-by: default avatarEvgeniy Polyakov <zbr@ioremap.net>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      b086ff87
    • Florian Fainelli's avatar
      net: core: Inline netdev_features_size_check() · e283de3a
      Florian Fainelli authored
      We do not require this inline function to be used in multiple different
      locations, just inline it where it gets used in register_netdevice().
      Suggested-by: default avatarDavid Miller <davem@davemloft.net>
      Suggested-by: default avatarStephen Hemminger <stephen@networkplumber.org>
      Signed-off-by: default avatarFlorian Fainelli <f.fainelli@gmail.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      e283de3a
    • Willem de Bruijn's avatar
      udp: disable gso with no_check_tx · a8c744a8
      Willem de Bruijn authored
      Syzbot managed to send a udp gso packet without checksum offload into
      the gso stack by disabling tx checksum (UDP_NO_CHECK6_TX). This
      triggered the skb_warn_bad_offload.
      
        RIP: 0010:skb_warn_bad_offload+0x2bc/0x600 net/core/dev.c:2658
         skb_gso_segment include/linux/netdevice.h:4038 [inline]
         validate_xmit_skb+0x54d/0xd90 net/core/dev.c:3120
         __dev_queue_xmit+0xbf8/0x34c0 net/core/dev.c:3577
         dev_queue_xmit+0x17/0x20 net/core/dev.c:3618
      
      UDP_NO_CHECK6_TX sets skb->ip_summed to CHECKSUM_NONE just after the
      udp gso integrity checks in udp_(v6_)send_skb. Extend those checks to
      catch and fail in this case.
      
      After the integrity checks jump directly to the CHECKSUM_PARTIAL case
      to avoid reading the no_check_tx flags again (a TOCTTOU race).
      
      Fixes: bec1f6f6 ("udp: generate gso with UDP_SEGMENT")
      Signed-off-by: default avatarWillem de Bruijn <willemb@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      a8c744a8
    • Paul Blakey's avatar
      cls_flower: Support multiple masks per priority · 05cd271f
      Paul Blakey authored
      Currently flower doesn't support inserting filters with different masks
      on a single priority, even if the actual flows (key + mask) inserted
      aren't overlapping, as with the use case of offloading openvswitch
      datapath flows. Instead one must go up one level, and assign different
      priorities for each mask, which will create a different flower
      instances.
      
      This patch opens flower to support more than one mask per priority,
      and a single flower instance. It does so by adding another hash table
      on top of the existing one which will store the different masks,
      and the filters that share it.
      
      The user is left with the responsibility of ensuring non overlapping
      flows, otherwise precedence is not guaranteed.
      Signed-off-by: default avatarPaul Blakey <paulb@mellanox.com>
      Signed-off-by: default avatarJiri Pirko <jiri@mellanox.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      05cd271f
    • David S. Miller's avatar
      Merge branch 'sctp-unify-sctp_make_op_error_fixed-and-sctp_make_op_error_space' · 9908b363
      David S. Miller authored
      Marcelo Ricardo Leitner says:
      
      ====================
      sctp: unify sctp_make_op_error_fixed and sctp_make_op_error_space
      
      These two variants are very close to each other and can be merged
      to avoid code duplication. That's what this patchset does.
      
      First, we allow sctp_init_cause to return errors, which then allow us to
      add sctp_make_op_error_limited that handles both situations.
      ====================
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      9908b363
    • Marcelo Ricardo Leitner's avatar
      sctp: add sctp_make_op_error_limited and reuse inner functions · 8914f4ba
      Marcelo Ricardo Leitner authored
      The idea is quite similar to the old functions, but note that the _fixed
      function wasn't "fixed" as in that it would generate a packet with a fixed
      size, but rather limited/bounded to PMTU.
      
      Also, now with sctp_mtu_payload(), we have a more accurate limit.
      Signed-off-by: default avatarMarcelo Ricardo Leitner <marcelo.leitner@gmail.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      8914f4ba
    • Marcelo Ricardo Leitner's avatar
      sctp: allow sctp_init_cause to return errors · 6d3e8aa8
      Marcelo Ricardo Leitner authored
      And do so if the skb doesn't have enough space for the payload.
      This is a preparation for the next patch.
      Signed-off-by: default avatarMarcelo Ricardo Leitner <marcelo.leitner@gmail.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      6d3e8aa8
    • David S. Miller's avatar
      Merge branch 'net-stmmac-dwmac-meson-100M-phy-mode-support-for-AXG-SoC' · 065662d9
      David S. Miller authored
      Yixun Lan says:
      
      ====================
      net: stmmac: dwmac-meson: 100M phy mode support for AXG SoC
      
      Due to the dwmac glue layer register changed, we need to
      introduce a new compatible name for the Meson-AXG SoC
      to support for the RMII 100M ethernet PHY.
      
      Change since v1 at [1]:
        - implement set_phy_mode() for each SoC
      
      [1] https://lkml.kernel.org/r/20180426160508.29380-1-yixun.lan@amlogic.com
      ====================
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      065662d9
    • Yixun Lan's avatar
      net: stmmac: dwmac-meson: extend phy mode setting · efacb568
      Yixun Lan authored
      In the Meson-AXG SoC, the phy mode setting of PRG_ETH0 in the glue layer
      is extended from bit[0] to bit[2:0].
        There is no problem if we configure it to the RGMII 1000M PHY mode,
      since the register setting is coincidentally compatible with previous one,
      but for the RMII 100M PHY mode, the configuration need to be changed to
      value - b100.
        This patch was verified with a RTL8201F 100M ethernet PHY.
      Signed-off-by: default avatarYixun Lan <yixun.lan@amlogic.com>
      Acked-by: default avatarMartin Blumenstingl <martin.blumenstingl@googlemail.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      efacb568
    • Yixun Lan's avatar
      dt-bindings: net: meson-dwmac: new compatible name for AXG SoC · 7e5d05e1
      Yixun Lan authored
      We need to introduce a new compatible name for the Meson-AXG SoC
      in order to support the RMII 100M ethernet PHY, since the PRG_ETH0
      register of the dwmac glue layer is changed from previous old SoC.
      Signed-off-by: default avatarYixun Lan <yixun.lan@amlogic.com>
      Reviewed-by: default avatarRob Herring <robh@kernel.org>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      7e5d05e1
    • David S. Miller's avatar
      Merge branch 'netns-uevent-filtering' · 90d52d4f
      David S. Miller authored
      Christian Brauner says:
      
      ====================
      netns: uevent filtering
      
      This is the new approach to uevent filtering as discussed (see the
      threads in [1], [2], and [3]). It only contains *non-functional
      changes*.
      
      This series deals with with fixing up uevent filtering logic:
      - uevent filtering logic is simplified
      - locking time on uevent_sock_list is minimized
      - tagged and untagged kobjects are handled in separate codepaths
      - permissions for userspace are fixed for network device uevents in
        network namespaces owned by non-initial user namespaces
        Udev is now able to see those events correctly which it wasn't before.
        For example, moving a physical device into a network namespace not
        owned by the initial user namespaces before gave:
      
        root@xen1:~# udevadm --debug monitor -k
        calling: monitor
        monitor will print the received events for:
        KERNEL - the kernel uevent
      
        sender uid=65534, message ignored
        sender uid=65534, message ignored
        sender uid=65534, message ignored
        sender uid=65534, message ignored
        sender uid=65534, message ignored
      
        and now after the discussion and solution in [3] correctly gives:
      
        root@xen1:~# udevadm --debug monitor -k
        calling: monitor
        monitor will print the received events for:
        KERNEL - the kernel uevent
      
        KERNEL[625.301042] add      /devices/pci0000:00/0000:00:02.0/0000:01:00.1/net/enp1s0f1 (net)
        KERNEL[625.301109] move     /devices/pci0000:00/0000:00:02.0/0000:01:00.1/net/enp1s0f1 (net)
        KERNEL[625.301138] move     /devices/pci0000:00/0000:00:02.0/0000:01:00.1/net/eth1 (net)
        KERNEL[655.333272] remove /devices/pci0000:00/0000:00:02.0/0000:01:00.1/net/eth1 (net)
      
      Thanks!
      Christian
      
      [1]: https://lkml.org/lkml/2018/4/4/739
      [2]: https://lkml.org/lkml/2018/4/26/767
      [3]: https://lkml.org/lkml/2018/4/26/738
      ====================
      Acked-by: default avatar"Eric W. Biederman" <ebiederm@xmission.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      90d52d4f
    • Christian Brauner's avatar
      netns: restrict uevents · a3498436
      Christian Brauner authored
      commit 07e98962 ("kobject: Send hotplug events in all network namespaces")
      
      enabled sending hotplug events into all network namespaces back in 2010.
      Over time the set of uevents that get sent into all network namespaces has
      shrunk. We have now reached the point where hotplug events for all devices
      that carry a namespace tag are filtered according to that namespace.
      Specifically, they are filtered whenever the namespace tag of the kobject
      does not match the namespace tag of the netlink socket.
      Currently, only network devices carry namespace tags (i.e. network
      namespace tags). Hence, uevents for network devices only show up in the
      network namespace such devices are created in or moved to.
      
      However, any uevent for a kobject that does not have a namespace tag
      associated with it will not be filtered and we will broadcast it into all
      network namespaces. This behavior stopped making sense when user namespaces
      were introduced.
      
      This patch simplifies and fixes couple of things:
      - Split codepath for sending uevents by kobject namespace tags:
        1. Untagged kobjects - uevent_net_broadcast_untagged():
           Untagged kobjects will be broadcast into all uevent sockets recorded
           in uevent_sock_list, i.e. into all network namespacs owned by the
           intial user namespace.
        2. Tagged kobjects - uevent_net_broadcast_tagged():
           Tagged kobjects will only be broadcast into the network namespace they
           were tagged with.
        Handling of tagged kobjects in 2. does not cause any semantic changes.
        This is just splitting out the filtering logic that was handled by
        kobj_bcast_filter() before.
        Handling of untagged kobjects in 1. will cause a semantic change. The
        reasons why this is needed and ok have been discussed in [1]. Here is a
        short summary:
        - Userspace ignores uevents from network namespaces that are not owned by
          the intial user namespace:
          Uevents are filtered by userspace in a user namespace because the
          received uid != 0. Instead the uid associated with the event will be
          65534 == "nobody" because the global root uid is not mapped.
          This means we can safely and without introducing regressions modify the
          kernel to not send uevents into all network namespaces whose owning
          user namespace is not the initial user namespace because we know that
          userspace will ignore the message because of the uid anyway.
          I have a) verified that is is true for every udev implementation out
          there b) that this behavior has been present in all udev
          implementations from the very beginning.
        - Thundering herd:
          Broadcasting uevents into all network namespaces introduces significant
          overhead.
          All processes that listen to uevents running in non-initial user
          namespaces will end up responding to uevents that will be meaningless
          to them. Mainly, because non-initial user namespaces cannot easily
          manage devices unless they have a privileged host-process helping them
          out. This means that there will be a thundering herd of activity when
          there shouldn't be any.
        - Removing needless overhead/Increasing performance:
          Currently, the uevent socket for each network namespace is added to the
          global variable uevent_sock_list. The list itself needs to be protected
          by a mutex. So everytime a uevent is generated the mutex is taken on
          the list. The mutex is held *from the creation of the uevent (memory
          allocation, string creation etc. until all uevent sockets have been
          handled*. This is aggravated by the fact that for each uevent socket
          that has listeners the mc_list must be walked as well which means we're
          talking O(n^2) here. Given that a standard Linux workload usually has
          quite a lot of network namespaces and - in the face of containers - a
          lot of user namespaces this quickly becomes a performance problem (see
          "Thundering herd" above). By just recording uevent sockets of network
          namespaces that are owned by the initial user namespace we
          significantly increase performance in this codepath.
        - Injecting uevents:
          There's a valid argument that containers might be interested in
          receiving device events especially if they are delegated to them by a
          privileged userspace process. One prime example are SR-IOV enabled
          devices that are explicitly designed to be handed of to other users
          such as VMs or containers.
          This use-case can now be correctly handled since
          commit 692ec06d ("netns: send uevent messages"). This commit
          introduced the ability to send uevents from userspace. As such we can
          let a sufficiently privileged (CAP_SYS_ADMIN in the owning user
          namespace of the network namespace of the netlink socket) userspace
          process make a decision what uevents should be sent. This removes the
          need to blindly broadcast uevents into all user namespaces and provides
          a performant and safe solution to this problem.
        - Filtering logic:
          This patch filters by *owning user namespace of the network namespace a
          given task resides in* and not by user namespace of the task per se.
          This means if the user namespace of a given task is unshared but the
          network namespace is kept and is owned by the initial user namespace a
          listener that is opening the uevent socket in that network namespace
          can still listen to uevents.
      - Fix permission for tagged kobjects:
        Network devices that are created or moved into a network namespace that
        is owned by a non-initial user namespace currently are send with
        INVALID_{G,U}ID in their credentials. This means that all current udev
        implementations in userspace will ignore the uevent they receive for
        them. This has lead to weird bugs whereby new devices showing up in such
        network namespaces were not recognized and did not get IPs assigned etc.
        This patch adjusts the permission to the appropriate {g,u}id in the
        respective user namespace. This way udevd is able to correctly handle
        such devices.
      - Simplify filtering logic:
        do_one_broadcast() already ensures that only listeners in mc_list receive
        uevents that have the same network namespace as the uevent socket itself.
        So the filtering logic in kobj_bcast_filter is not needed (see [3]). This
        patch therefore removes kobj_bcast_filter() and replaces
        netlink_broadcast_filtered() with the simpler netlink_broadcast()
        everywhere.
      
      [1]: https://lkml.org/lkml/2018/4/4/739
      [2]: https://lkml.org/lkml/2018/4/26/767
      [3]: https://lkml.org/lkml/2018/4/26/738Signed-off-by: default avatarChristian Brauner <christian.brauner@ubuntu.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      a3498436
    • Christian Brauner's avatar
      uevent: add alloc_uevent_skb() helper · 26045a7b
      Christian Brauner authored
      This patch adds alloc_uevent_skb() in preparation for follow up patches.
      Signed-off-by: default avatarChristian Brauner <christian.brauner@ubuntu.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      26045a7b
    • David S. Miller's avatar
      Merge branch 'tls-offload-netdev-and-mlx5-support' · e33200bc
      David S. Miller authored
      Boris Pismenny says:
      
      ====================
      TLS offload, netdev & MLX5 support
      
      The following series provides TLS TX inline crypto offload.
      
      v1->v2:
         - Added IS_ENABLED(CONFIG_TLS_DEVICE) and a STATIC_KEY for icsk_clean_acked
         - File license fix
         - Fix spelling, comment by DaveW
         - Move memory allocations out of tls_set_device_offload and other misc fixes,
      	comments by Kiril.
      
      v2->v3:
         - Reversed xmas tree where needed and style fixes
         - Removed the need for skb_page_frag_refill, per Eric's comment
         - IPv6 dependency fixes
      
      v3->v4:
         - Remove "inline" from functions in C files
         - Make clean_acked_data_enabled a static variable and add enable/disable functions to control it.
         - Remove unnecessary variable initialization mentioned by ShannonN
         - Rebase over TLS RX
         - Refactor the tls_software_fallback to reduce the number of variables mentioned by KirilT
      
      v4->v5:
         - Add missing CONFIG_TLS_DEVICE
      
      v5->v6:
         - Move changes to the software implementation into a seperate patch
         - Fix some checkpatch warnings
         - GPL export the enable/disable clean_acked_data functions
      
      v6->v7:
         - Use the dst_entry to obtain the netdev in dev_get_by_index
         - Remove the IPv6 patch since it is redundent now
      
      v7->v8:
         - Fix a merge conflict in mlx5 header
      
      v8->v9:
         - Fix false -Wmaybe-uninitialized warning
         - Fix empty space in the end of new files
      
      v9->v10:
         - Remove default "n" in net/Kconfig
      
      This series adds a generic infrastructure to offload TLS crypto to a
      network devices. It enables the kernel TLS socket to skip encryption and
      authentication operations on the transmit side of the data path. Leaving
      those computationally expensive operations to the NIC.
      
      The NIC offload infrastructure builds TLS records and pushes them to the
      TCP layer just like the SW KTLS implementation and using the same API.
      TCP segmentation is mostly unaffected. Currently the only exception is
      that we prevent mixed SKBs where only part of the payload requires
      offload. In the future we are likely to add a similar restriction
      following a change cipher spec record.
      
      The notable differences between SW KTLS and NIC offloaded TLS
      implementations are as follows:
      1. The offloaded implementation builds "plaintext TLS record", those
      records contain plaintext instead of ciphertext and place holder bytes
      instead of authentication tags.
      2. The offloaded implementation maintains a mapping from TCP sequence
      number to TLS records. Thus given a TCP SKB sent from a NIC offloaded
      TLS socket, we can use the tls NIC offload infrastructure to obtain
      enough context to encrypt the payload of the SKB.
      A TLS record is released when the last byte of the record is ack'ed,
      this is done through the new icsk_clean_acked callback.
      
      The infrastructure should be extendable to support various NIC offload
      implementations.  However it is currently written with the
      implementation below in mind:
      The NIC assumes that packets from each offloaded stream are sent as
      plaintext and in-order. It keeps track of the TLS records in the TCP
      stream. When a packet marked for offload is transmitted, the NIC
      encrypts the payload in-place and puts authentication tags in the
      relevant place holders.
      
      The responsibility for handling out-of-order packets (i.e. TCP
      retransmission, qdisc drops) falls on the netdev driver.
      
      The netdev driver keeps track of the expected TCP SN from the NIC's
      perspective.  If the next packet to transmit matches the expected TCP
      SN, the driver advances the expected TCP SN, and transmits the packet
      with TLS offload indication.
      
      If the next packet to transmit does not match the expected TCP SN. The
      driver calls the TLS layer to obtain the TLS record that includes the
      TCP of the packet for transmission. Using this TLS record, the driver
      posts a work entry on the transmit queue to reconstruct the NIC TLS
      state required for the offload of the out-of-order packet. It updates
      the expected TCP SN accordingly and transmit the now in-order packet.
      The same queue is used for packet transmission and TLS context
      reconstruction to avoid the need for flushing the transmit queue before
      issuing the context reconstruction request.
      
      Expected TCP SN is accessed without a lock, under the assumption that
      TCP doesn't transmit SKBs from different TX queue concurrently.
      
      If packets are rerouted to a different netdevice, then a software
      fallback routine handles encryption.
      
      Paper: https://www.netdevconf.org/1.2/papers/netdevconf-TLS.pdf
      ====================
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      e33200bc
    • Boris Pismenny's avatar
      f9c8141f
    • Boris Pismenny's avatar
    • Ilya Lesokhin's avatar
      net/mlx5e: TLS, Add error statistics · 43585a41
      Ilya Lesokhin authored
      Add statistics for rare TLS related errors.
      Since the errors are rare we have a counter per netdev
      rather then per SQ.
      Signed-off-by: default avatarIlya Lesokhin <ilyal@mellanox.com>
      Signed-off-by: default avatarBoris Pismenny <borisp@mellanox.com>
      Acked-by: default avatarSaeed Mahameed <saeedm@mellanox.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      43585a41
    • Ilya Lesokhin's avatar
      net/mlx5e: TLS, Add Innova TLS TX offload data path · bf239741
      Ilya Lesokhin authored
      Implement the TLS tx offload data path according to the
      requirements of the TLS generic NIC offload infrastructure.
      
      Special metadata ethertype is used to pass information to
      the hardware.
      Signed-off-by: default avatarIlya Lesokhin <ilyal@mellanox.com>
      Signed-off-by: default avatarBoris Pismenny <borisp@mellanox.com>
      Acked-by: default avatarSaeed Mahameed <saeedm@mellanox.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      bf239741
    • Ilya Lesokhin's avatar
      net/mlx5e: TLS, Add Innova TLS TX support · c83294b9
      Ilya Lesokhin authored
      Add NETIF_F_HW_TLS_TX capability and expose tlsdev_ops to work with the
      TLS generic NIC offload infrastructure.
      The NETIF_F_HW_TLS_TX capability will be added in the next patch.
      Signed-off-by: default avatarIlya Lesokhin <ilyal@mellanox.com>
      Signed-off-by: default avatarBoris Pismenny <borisp@mellanox.com>
      Acked-by: default avatarSaeed Mahameed <saeedm@mellanox.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      c83294b9
    • Ilya Lesokhin's avatar
      net/mlx5: Accel, Add TLS tx offload interface · 1ae17322
      Ilya Lesokhin authored
      Add routines for manipulating TLS TX offload contexts.
      
      In Innova TLS, TLS contexts are added or deleted
      via a command message over the SBU connection.
      The HW then sends a response message over the same connection.
      
      Add implementation for Innova TLS (FPGA-based) hardware.
      
      These routines will be used by the TLS offload support in a later patch
      
      mlx5/accel is a middle acceleration layer to allow mlx5e and other ULPs
      to work directly with mlx5_core rather than Innova FPGA or other mlx5
      acceleration providers.
      
      In the future, when IPSec/TLS or any other acceleration gets integrated
      into ConnectX chip, mlx5/accel layer will provide the integrated
      acceleration, rather than the Innova one.
      Signed-off-by: default avatarIlya Lesokhin <ilyal@mellanox.com>
      Signed-off-by: default avatarBoris Pismenny <borisp@mellanox.com>
      Acked-by: default avatarSaeed Mahameed <saeedm@mellanox.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      1ae17322
    • Ilya Lesokhin's avatar
      net/mlx5e: Move defines out of ipsec code · bb909416
      Ilya Lesokhin authored
      The defines are not IPSEC specific.
      Signed-off-by: default avatarIlya Lesokhin <ilyal@mellanox.com>
      Signed-off-by: default avatarBoris Pismenny <borisp@mellanox.com>
      Acked-by: default avatarSaeed Mahameed <saeedm@mellanox.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      bb909416
    • Ilya Lesokhin's avatar
      net/tls: Add generic NIC offload infrastructure · e8f69799
      Ilya Lesokhin authored
      This patch adds a generic infrastructure to offload TLS crypto to a
      network device. It enables the kernel TLS socket to skip encryption
      and authentication operations on the transmit side of the data path.
      Leaving those computationally expensive operations to the NIC.
      
      The NIC offload infrastructure builds TLS records and pushes them to
      the TCP layer just like the SW KTLS implementation and using the same
      API.
      TCP segmentation is mostly unaffected. Currently the only exception is
      that we prevent mixed SKBs where only part of the payload requires
      offload. In the future we are likely to add a similar restriction
      following a change cipher spec record.
      
      The notable differences between SW KTLS and NIC offloaded TLS
      implementations are as follows:
      1. The offloaded implementation builds "plaintext TLS record", those
      records contain plaintext instead of ciphertext and place holder bytes
      instead of authentication tags.
      2. The offloaded implementation maintains a mapping from TCP sequence
      number to TLS records. Thus given a TCP SKB sent from a NIC offloaded
      TLS socket, we can use the tls NIC offload infrastructure to obtain
      enough context to encrypt the payload of the SKB.
      A TLS record is released when the last byte of the record is ack'ed,
      this is done through the new icsk_clean_acked callback.
      
      The infrastructure should be extendable to support various NIC offload
      implementations.  However it is currently written with the
      implementation below in mind:
      The NIC assumes that packets from each offloaded stream are sent as
      plaintext and in-order. It keeps track of the TLS records in the TCP
      stream. When a packet marked for offload is transmitted, the NIC
      encrypts the payload in-place and puts authentication tags in the
      relevant place holders.
      
      The responsibility for handling out-of-order packets (i.e. TCP
      retransmission, qdisc drops) falls on the netdev driver.
      
      The netdev driver keeps track of the expected TCP SN from the NIC's
      perspective.  If the next packet to transmit matches the expected TCP
      SN, the driver advances the expected TCP SN, and transmits the packet
      with TLS offload indication.
      
      If the next packet to transmit does not match the expected TCP SN. The
      driver calls the TLS layer to obtain the TLS record that includes the
      TCP of the packet for transmission. Using this TLS record, the driver
      posts a work entry on the transmit queue to reconstruct the NIC TLS
      state required for the offload of the out-of-order packet. It updates
      the expected TCP SN accordingly and transmits the now in-order packet.
      The same queue is used for packet transmission and TLS context
      reconstruction to avoid the need for flushing the transmit queue before
      issuing the context reconstruction request.
      Signed-off-by: default avatarIlya Lesokhin <ilyal@mellanox.com>
      Signed-off-by: default avatarBoris Pismenny <borisp@mellanox.com>
      Signed-off-by: default avatarAviad Yehezkel <aviadye@mellanox.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      e8f69799
    • Boris Pismenny's avatar
      net/tls: Split conf to rx + tx · f66de3ee
      Boris Pismenny authored
      In TLS inline crypto, we can have one direction in software
      and another in hardware. Thus, we split the TLS configuration to separate
      structures for receive and transmit.
      Signed-off-by: default avatarBoris Pismenny <borisp@mellanox.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      f66de3ee
    • Ilya Lesokhin's avatar
      net: Add TLS TX offload features · 2342a851
      Ilya Lesokhin authored
      This patch adds a netdev feature to configure TLS TX offloads.
      Signed-off-by: default avatarIlya Lesokhin <ilyal@mellanox.com>
      Signed-off-by: default avatarBoris Pismenny <borisp@mellanox.com>
      Signed-off-by: default avatarAviad Yehezkel <aviadye@mellanox.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      2342a851
    • Ilya Lesokhin's avatar
      net: Add TLS offload netdev ops · a5c37c63
      Ilya Lesokhin authored
      Add new netdev ops to add and delete tls context
      Signed-off-by: default avatarIlya Lesokhin <ilyal@mellanox.com>
      Signed-off-by: default avatarBoris Pismenny <borisp@mellanox.com>
      Signed-off-by: default avatarAviad Yehezkel <aviadye@mellanox.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      a5c37c63
    • Ilya Lesokhin's avatar
      net: Add Software fallback infrastructure for socket dependent offloads · ebf4e808
      Ilya Lesokhin authored
      With socket dependent offloads we rely on the netdev to transform
      the transmitted packets before sending them to the wire.
      When a packet from an offloaded socket is rerouted to a different
      device we need to detect it and do the transformation in software.
      Signed-off-by: default avatarIlya Lesokhin <ilyal@mellanox.com>
      Signed-off-by: default avatarBoris Pismenny <borisp@mellanox.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      ebf4e808
    • Ilya Lesokhin's avatar
      net: Rename and export copy_skb_header · 08303c18
      Ilya Lesokhin authored
      copy_skb_header is renamed to skb_copy_header and
      exported. Exposing this function give more flexibility
      in copying SKBs.
      skb_copy and skb_copy_expand do not give enough control
      over which parts are copied.
      Signed-off-by: default avatarIlya Lesokhin <ilyal@mellanox.com>
      Signed-off-by: default avatarBoris Pismenny <borisp@mellanox.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      08303c18
    • Ilya Lesokhin's avatar
      tcp: Add clean acked data hook · 6dac1523
      Ilya Lesokhin authored
      Called when a TCP segment is acknowledged.
      Could be used by application protocols who hold additional
      metadata associated with the stream data.
      
      This is required by TLS device offload to release
      metadata associated with acknowledged TLS records.
      Signed-off-by: default avatarIlya Lesokhin <ilyal@mellanox.com>
      Signed-off-by: default avatarBoris Pismenny <borisp@mellanox.com>
      Signed-off-by: default avatarAviad Yehezkel <aviadye@mellanox.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      6dac1523
    • David S. Miller's avatar
      Merge branch '40GbE' of git://git.kernel.org/pub/scm/linux/kernel/git/jkirsher/next-queue · 1a1f4a28
      David S. Miller authored
      Jeff Kirsher says:
      
      ====================
      40GbE Intel Wired LAN Driver Updates 2018-04-30
      
      This series contains updates to i40e and i40evf only.
      
      Jia-Ju Bai replaces an instance of GFP_ATOMIC to GFP_KERNEL, since
      i40evf is not in atomic context when i40evf_add_vlan() is called.
      
      Jake cleans up function header comments to ensure that the function
      parameter comments actually match the function parameters.  Fixed a
      possible overflow error in the PTP clock code.  Fixed warnings regarding
      restricted __be32 type usage.
      
      Mariusz fixes the reading of the LLDP configuration, which moves from
      using relative values to calculating the absolute address.
      
      Jakub adds a check for 10G LR mode for i40e.
      
      Paweł fixes an issue, where changing the MTU would turn on TSO, GSO and
      GRO.
      
      Alex fixes a couple of issues with the UDP tunnel filter configuration.
      First being that the tunnels did not have mutual exclusion in place to
      prevent a race condition between a user request to add/remove a port and
      an update.  The second issue was we were deleting filters that were not
      associated with the actual filter we wanted to delete.
      
      Harshitha ensures that the queue map sent by the VF is taken into
      account when enabling/disabling queues in the VF VSI.
      ====================
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      1a1f4a28