1. 04 Feb, 2017 1 commit
    • David S. Miller's avatar
      Merge branch '40GbE' of git://git.kernel.org/pub/scm/linux/kernel/git/jkirsher/next-queue · a076d1bd
      David S. Miller authored
      Jeff Kirsher says:
      
      ====================
      40GbE Intel Wired LAN Driver Updates 2017-02-03
      
      This series contains updates to i40e/i40evf only.
      
      Jake fixes up the driver to not call i40e_vsi_kill_vlan() or
      i40e_vsi_add_vlan() when the PVID is set or when the VID is less than 1.
      Cleaned up a check which really is not needed since there is no real
      reason why we cannot just call i40e_del_mac_all_vlan() directly.  Renamed
      functions to better reflect their actual purpose and how they function
      in a more clear manner.
      
      Bimmy cleans up unused/deprecated macros.
      
      Mitch cleans up unused device ids which were intended for use when
      running Linux VF drivers under Hyper-V, but found to be not needed.
      Then cleaned up a function that is no longer needed since the client
      open and close functions were refactored.  Adds a sleep without timeout
      until the reply from the PF driver has been received since the iWARP
      client cannot continue until the operation has been completed.
      
      Tushar Dave fixes an issue seen on SPARC where the use of the 'packed'
      directive was causing kernel unaligned errors.
      
      Alex does a refactor to pull some data off of the stack and store it
      in the transmit buffer info section of the transmit ring.
      
      Alan fixes a bug which was caused by passing a bad register value to the
      firmware, by refactoring the macro INTRL_USEC_TO_REG into a static
      inline function.  Also added feedback to the user as to the actual
      interrupt rate limit being used when it differs from the requested limit.
      ====================
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      a076d1bd
  2. 03 Feb, 2017 39 commits
    • Eric Dumazet's avatar
      net: skb_needs_check() accepts CHECKSUM_NONE for tx · 6e7bc478
      Eric Dumazet authored
      My recent change missed fact that UFO would perform a complete
      UDP checksum before segmenting in frags.
      
      In this case skb->ip_summed is set to CHECKSUM_NONE.
      
      We need to add this valid case to skb_needs_check()
      
      Fixes: b2504a5d ("net: reduce skb_warn_bad_offload() noise")
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Cc: Willem de Bruijn <willemb@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      6e7bc478
    • Eric Dumazet's avatar
      net: remove support for per driver ndo_busy_poll() · 79e7fff4
      Eric Dumazet authored
      We added generic support for busy polling in NAPI layer in linux-4.5
      
      No network driver uses ndo_busy_poll() anymore, we can get rid
      of the pointer in struct net_device_ops, and its use in sk_busy_loop()
      
      Saves NETIF_F_BUSY_POLL features bit.
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      79e7fff4
    • David S. Miller's avatar
      enic: Remove local ndo_busy_poll() implementation. · 7a655c63
      David S. Miller authored
      We do polling generically these days.
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      7a655c63
    • Eric Dumazet's avatar
      ixgbevf: get rid of custom busy polling code · 508aac6d
      Eric Dumazet authored
      In linux-4.5, busy polling was implemented in core
      NAPI stack, meaning that all custom implementation can
      be removed from drivers.
      
      Not only we remove lot's of code, we also remove one lock
      operation in fast path, and allow GRO to do its job.
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Cc: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
      Acked-by: default avatarAlexander Duyck <alexander.h.duyck@intel.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      508aac6d
    • Eric Dumazet's avatar
      ixgbe: get rid of custom busy polling code · 3ffc1af5
      Eric Dumazet authored
      In linux-4.5, busy polling was implemented in core
      NAPI stack, meaning that all custom implementation can
      be removed from drivers.
      
      Not only we remove lot's of code, we also remove one lock
      operation in fast path, and allow GRO to do its job.
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Cc: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
      Acked-by: default avatarAlexander Duyck <alexander.h.duyck@intel.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      3ffc1af5
    • David S. Miller's avatar
      Merge git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nf-next · 52e01b84
      David S. Miller authored
      Pablo Neira Ayuso says:
      
      ====================
      Netfilter updates for net-next
      
      The following patchset contains Netfilter updates for your net-next
      tree, they are:
      
      1) Stash ctinfo 3-bit field into pointer to nf_conntrack object from
         sk_buff so we only access one single cacheline in the conntrack
         hotpath. Patchset from Florian Westphal.
      
      2) Don't leak pointer to internal structures when exporting x_tables
         ruleset back to userspace, from Willem DeBruijn. This includes new
         helper functions to copy data to userspace such as xt_data_to_user()
         as well as conversions of our ip_tables, ip6_tables and arp_tables
         clients to use it. Not surprinsingly, ebtables requires an ad-hoc
         update. There is also a new field in x_tables extensions to indicate
         the amount of bytes that we copy to userspace.
      
      3) Add nf_log_all_netns sysctl: This new knob allows you to enable
         logging via nf_log infrastructure for all existing netnamespaces.
         Given the effort to provide pernet syslog has been discontinued,
         let's provide a way to restore logging using netfilter kernel logging
         facilities in trusted environments. Patch from Michal Kubecek.
      
      4) Validate SCTP checksum from conntrack helper, from Davide Caratti.
      
      5) Merge UDPlite conntrack and NAT helpers into UDP, this was mostly
         a copy&paste from the original helper, from Florian Westphal.
      
      6) Reset netfilter state when duplicating packets, also from Florian.
      
      7) Remove unnecessary check for broadcast in IPv6 in pkttype match and
         nft_meta, from Liping Zhang.
      
      8) Add missing code to deal with loopback packets from nft_meta when
         used by the netdev family, also from Liping.
      
      9) Several cleanups on nf_tables, one to remove unnecessary check from
         the netlink control plane path to add table, set and stateful objects
         and code consolidation when unregister chain hooks, from Gao Feng.
      
      10) Fix harmless reference counter underflow in IPVS that, however,
          results in problems with the introduction of the new refcount_t
          type, from David Windsor.
      
      11) Enable LIBCRC32C from nf_ct_sctp instead of nf_nat_sctp,
          from Davide Caratti.
      
      12) Missing documentation on nf_tables uapi header, from Liping Zhang.
      
      13) Use rb_entry() helper in xt_connlimit, from Geliang Tang.
      ====================
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      52e01b84
    • David S. Miller's avatar
      Merge branch 'mlxsw-Introduce-TC-Flower-offload-using-TCAM' · e60df624
      David S. Miller authored
      Jiri Pirko says:
      
      ====================
      mlxsw: Introduce TC Flower offload using TCAM
      
      This patchset introduces support for offloading TC cls_flower and actions
      to Spectrum TCAM-base policy engine.
      
      The patchset contains patches to allow work with flexible keys and actions
      which are used in Spectrum TCAM.
      
      It also contains in-driver infrastructure for offloading TC rules to TCAM HW.
      The TCAM management code is simple and limited for now. It is going to be
      extended as a follow-up work.
      
      The last patch uses the previously introduced infra to allow to implement
      cls_flower offloading. Initially, only limited set of match-keys and only
      a drop and forward actions are supported.
      
      As a dependency, this patchset introduces parman - priority array
      area manager - as a library.
      
      v1->v2:
      - patch11:
        - use __set_bit and __test_and_clear_bit as suggested by DaveM
      - patch16:
        - Added documentation to the API functions as suggested by Tom Herbert
      - patch17:
        - use __set_bit and __clear_bit as suggested by DaveM
      ====================
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      e60df624
    • Jiri Pirko's avatar
      mlxsw: spectrum: Implement TC flower offload · 7aa0f5aa
      Jiri Pirko authored
      Extend the existing setup_tc ndo call and allow to offload cls_flower
      rules. Only limited set of dissector keys and actions are supported now.
      Use previously introduced ACL infrastructure to offload cls_flower rules
      to be processed in the HW.
      Signed-off-by: default avatarJiri Pirko <jiri@mellanox.com>
      Reviewed-by: default avatarIdo Schimmel <idosch@mellanox.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      7aa0f5aa
    • Jiri Pirko's avatar
      sched: cls_flower: expose priority to offloading netdevice · 69ca05ce
      Jiri Pirko authored
      The driver that offloads flower rules needs to know with which priority
      user inserted the rules. So add this information into offload struct.
      Signed-off-by: default avatarJiri Pirko <jiri@mellanox.com>
      Acked-by: default avatarIdo Schimmel <idosch@mellanox.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      69ca05ce
    • Jiri Pirko's avatar
      mlxsw: spectrum: Introduce ACL core with simple TCAM implementation · 22a67766
      Jiri Pirko authored
      Add ACL core infrastructure for Spectrum ASIC. This infra provides an
      abstraction layer over specific HW implementations. There are two basic
      objects used. One is "rule" and the second is "ruleset" which serves as a
      container of multiple rules. In general, within one ruleset the rules are
      allowed to have multiple priorities and masks. Each ruleset is bound to
      either ingress or egress a of port netdevice.
      
      The initial TCAM implementation is very simple and limited. It utilizes
      parman lsort manager to take care of TCAM region layout.
      Signed-off-by: default avatarJiri Pirko <jiri@mellanox.com>
      Reviewed-by: default avatarIdo Schimmel <idosch@mellanox.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      22a67766
    • Jiri Pirko's avatar
      lib: Introduce priority array area manager · 44091d29
      Jiri Pirko authored
      This introduces a infrastructure for management of linear priority
      areas. Priority order in an array matters, however order of items inside
      a priority group does not matter.
      
      As an initial implementation, L-sort algorithm is used. It is quite
      trivial. More advanced algorithm called P-sort will be introduced as a
      follow-up. The infrastructure is prepared for other algos.
      
      Alongside this, a testing module is introduced as well.
      Signed-off-by: default avatarJiri Pirko <jiri@mellanox.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      44091d29
    • Jiri Pirko's avatar
      list: introduce list_for_each_entry_from_reverse helper · b862815c
      Jiri Pirko authored
      Similar to list_for_each_entry_continue and its reverse variant
      list_for_each_entry_continue_reverse, introduce reverse helper for
      list_for_each_entry_from.
      Signed-off-by: default avatarJiri Pirko <jiri@mellanox.com>
      Acked-by: default avatarIdo Schimmel <idosch@mellanox.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      b862815c
    • Jiri Pirko's avatar
      mlxsw: resources: Add ACL related resources · 8708ecf0
      Jiri Pirko authored
      Add couple of resource limits related to ACL.
      Signed-off-by: default avatarJiri Pirko <jiri@mellanox.com>
      Reviewed-by: default avatarIdo Schimmel <idosch@mellanox.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      8708ecf0
    • Jiri Pirko's avatar
      mlxsw: spectrum: Introduce basic set of flexible key blocks · b876b9aa
      Jiri Pirko authored
      Introduce basic set of Spectrum flexible key blocks. It contains blocks
      needed to carry all elements defined so far.
      Signed-off-by: default avatarJiri Pirko <jiri@mellanox.com>
      Reviewed-by: default avatarIdo Schimmel <idosch@mellanox.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      b876b9aa
    • Jiri Pirko's avatar
      mlxsw: core: Introduce flexible actions support · 4cda7d8d
      Jiri Pirko authored
      Each entry which is matched during ACL lookup points to an action set.
      This action set contains up to three separate actions. If more actions
      are needed to be chained, the extended set is created to hold them
      in KVD linear area.
      
      This patch implements handling of sets and encoding of actions.
      Currectly, only two actions are supported. Drop and forward. Forward
      action uses PBS pointer to KVD linear area, so the action code needs to
      take care of this as well.
      Signed-off-by: default avatarJiri Pirko <jiri@mellanox.com>
      Reviewed-by: default avatarIdo Schimmel <idosch@mellanox.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      4cda7d8d
    • Jiri Pirko's avatar
      mlxsw: core: Introduce flexible keys support · 3f1a84e6
      Jiri Pirko authored
      Hardware supports matching on so called "flexible keys". The idea is to
      assemble an optimal key to use for matching according to the fields in
      packet (elements) requested by user. Certain sets of elements are
      combined into pre-defined blocks. There is a picker to find needed blocks.
      Keys consist of 1..n blocks.
      
      Alongside with that, an initial portion of elements is introduced in order
      to be able to offload basic cls_flower rules.
      
      Picked keys are cached so multiple rules could share them.
      
      There is an encode function provided that takes care of encoding key and
      mask values according to given key.
      Signed-off-by: default avatarJiri Pirko <jiri@mellanox.com>
      Reviewed-by: default avatarIdo Schimmel <idosch@mellanox.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      3f1a84e6
    • Jiri Pirko's avatar
      mlxsw: reg: Add Policy-Engine Extended Flexible Action Register · e3426e12
      Jiri Pirko authored
      PEFA register is used for accessing an extended flexible action entry
      in the central KVD Linear Database.
      Signed-off-by: default avatarJiri Pirko <jiri@mellanox.com>
      Reviewed-by: default avatarIdo Schimmel <idosch@mellanox.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      e3426e12
    • Jiri Pirko's avatar
      mlxsw: reg: Add Policy-Engine Policy Based Switching Register · d120649d
      Jiri Pirko authored
      The PPBS register retrieves and sets Policy Based Switching Table entries.
      Signed-off-by: default avatarJiri Pirko <jiri@mellanox.com>
      Reviewed-by: default avatarIdo Schimmel <idosch@mellanox.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      d120649d
    • Jiri Pirko's avatar
      mlxsw: reg: Add Policy-Engine Rules Copy Register · 937b682c
      Jiri Pirko authored
      The PRCR register is used for accessing rules within a TCAM region.
      Signed-off-by: default avatarJiri Pirko <jiri@mellanox.com>
      Reviewed-by: default avatarIdo Schimmel <idosch@mellanox.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      937b682c
    • Jiri Pirko's avatar
      mlxsw: reg: Add Policy-Engine Port Binding Table · af7170ee
      Jiri Pirko authored
      The PPBT is used for configuration of the Port Binding Table.
      Signed-off-by: default avatarJiri Pirko <jiri@mellanox.com>
      Reviewed-by: default avatarIdo Schimmel <idosch@mellanox.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      af7170ee
    • Jiri Pirko's avatar
      mlxsw: reg: Add Policy-Engine TCAM Entry Register Version 2 · 0171cdec
      Jiri Pirko authored
      The PTCE-V2 register is used for accessing rules within a TCAM region.
      It is a new version of PTCE in order to support wider key, mask and
      action within a TCAM region.
      Signed-off-by: default avatarJiri Pirko <jiri@mellanox.com>
      Reviewed-by: default avatarIdo Schimmel <idosch@mellanox.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      0171cdec
    • Jiri Pirko's avatar
      mlxsw: reg: Add Policy-Engine TCAM Allocation Register · d9c2661e
      Jiri Pirko authored
      The PTAR register is used for allocation of regions in the TCAM.
      Signed-off-by: default avatarJiri Pirko <jiri@mellanox.com>
      Reviewed-by: default avatarIdo Schimmel <idosch@mellanox.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      d9c2661e
    • Jiri Pirko's avatar
      mlxsw: reg: Add Policy-Engine ACL Group Table register · 10fabef5
      Jiri Pirko authored
      The PAGT register is used for configuration of the ACL Group Table.
      Signed-off-by: default avatarJiri Pirko <jiri@mellanox.com>
      Reviewed-by: default avatarIdo Schimmel <idosch@mellanox.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      10fabef5
    • Jiri Pirko's avatar
      mlxsw: reg: Add Policy-Engine ACL Register · 3279da4c
      Jiri Pirko authored
      The PACL register is used for configuration of the ACL.
      Signed-off-by: default avatarJiri Pirko <jiri@mellanox.com>
      Reviewed-by: default avatarIdo Schimmel <idosch@mellanox.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      3279da4c
    • Jiri Pirko's avatar
      mlxsw: item: Add helpers for getting pointer into payload for char buffer item · d5e556c6
      Jiri Pirko authored
      Sometimes it is handy to get a pointer to a char buffer item and use it
      direcly to write/read data. So add these helpers.
      Signed-off-by: default avatarJiri Pirko <jiri@mellanox.com>
      Reviewed-by: default avatarIdo Schimmel <idosch@mellanox.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      d5e556c6
    • Jiri Pirko's avatar
      mlxsw: item: Add 8bit item helpers · 2946fde9
      Jiri Pirko authored
      Item heplers for 8bit values are needed, let's add them.
      Signed-off-by: default avatarJiri Pirko <jiri@mellanox.com>
      Reviewed-by: default avatarIdo Schimmel <idosch@mellanox.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      2946fde9
    • Zhu Yanjun's avatar
      bonding: Remove unnecessary returned value check · 3d67576d
      Zhu Yanjun authored
      The function bond_info_query alwarys returns 0. As such, in the function
      bond_do_ioctl, it is not necessary to check the returned value. So the
      interface type of the function bond_info_query is changed to void. The
      redundant check is removed.
      Signed-off-by: default avatarZhu Yanjun <yanjun.zhu@oracle.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      3d67576d
    • Eric Dumazet's avatar
      tcp: clear pfmemalloc on outgoing skb · 38ab52e8
      Eric Dumazet authored
      Josef Bacik diagnosed following problem :
      
         I was seeing random disconnects while testing NBD over loopback.
         This turned out to be because NBD sets pfmemalloc on it's socket,
         however the receiving side is a user space application so does not
         have pfmemalloc set on its socket. This means that
         sk_filter_trim_cap will simply drop this packet, under the
         assumption that the other side will simply retransmit. Well we do
         retransmit, and then the packet is just dropped again for the same
         reason.
      
      It seems the better way to address this problem is to clear pfmemalloc
      in the TCP transmit path. pfmemalloc strict control really makes sense
      on the receive path.
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Acked-by: default avatarJosef Bacik <jbacik@fb.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      38ab52e8
    • Eric Dumazet's avatar
      cxgb4: get rid of custom busy poll code · 5226b791
      Eric Dumazet authored
      In linux-4.5, busy polling was implemented in core
      NAPI stack, meaning that all custom implementation can
      be removed from drivers.
      
      Not only we remove lot of code, we also remove one spin_lock()
      from driver fast path.
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Cc: Ganesh Goudar <ganeshgr@chelsio.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      5226b791
    • Eric Dumazet's avatar
      myri10ge: get rid of custom busy poll code · 362108b5
      Eric Dumazet authored
      Compared to custom busy_poll, the generic NAPI one is simpler and
      removes a lot of code. It removes one atomic in the fast path (when
      busy poll is not in action) since we do not have to use an extra
      spinlock.
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      362108b5
    • Eric Dumazet's avatar
      be2net: get rid of custom busy poll code · fb6113e6
      Eric Dumazet authored
      Compared to custom busy_poll, the generic NAPI one is better, since
      it allows to use GRO, and it removes a lot of code and extra locked
      operations in fast path.
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Cc: Sathya Perla <sathya.perla@broadcom.com>
      Cc: Ajit Khaparde <ajit.khaparde@broadcom.com>
      Cc: Sriharsha Basavapatna <sriharsha.basavapatna@broadcom.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      fb6113e6
    • David Ahern's avatar
      net: ipv6: Set protocol to kernel for local routes · 94b5e0f9
      David Ahern authored
      IPv6 stack does not set the protocol for local routes, so those routes show
      up with proto "none":
          $ ip -6 ro ls table local
          local ::1 dev lo proto none metric 0  pref medium
          local 2100:3:: dev lo proto none metric 0  pref medium
          local 2100:3::4 dev lo proto none metric 0  pref medium
          local fe80:: dev lo proto none metric 0  pref medium
          ...
      
      Set rt6i_protocol to RTPROT_KERNEL for consistency with IPv4. Now routes
      show up with proto "kernel":
          $ ip -6 ro ls table local
          local ::1 dev lo proto kernel metric 0  pref medium
          local 2100:3:: dev lo proto kernel metric 0  pref medium
          local 2100:3::4 dev lo proto kernel metric 0  pref medium
          local fe80:: dev lo proto kernel metric 0  pref medium
          ...
      Signed-off-by: default avatarDavid Ahern <dsa@cumulusnetworks.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      94b5e0f9
    • Daniel Borkmann's avatar
      trace: rename trace_print_hex_seq arg and add kdoc · 3898fac1
      Daniel Borkmann authored
      Steven suggested to improve trace_print_hex_seq() a bit after commit
      2acae0d5 ("trace: add variant without spacing in trace_print_hex_seq")
      in two ways: i) by adding a kdoc comment for the helper function
      itself and ii) by renaming 'spacing' argument into 'concatenate'
      to better denote that we don't add spaces between each hex bytes.
      Suggested-by: default avatarSteven Rostedt <rostedt@goodmis.org>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      3898fac1
    • Jiri Pirko's avatar
      MAINTAINERS: add Ivan as a switchdev maintainer · f38c5ad7
      Jiri Pirko authored
      Ivan will be taking care of switchdev code from now on.
      Signed-off-by: default avatarJiri Pirko <jiri@mellanox.com>
      Acked-by: default avatarIvan Vecera <ivecera@redhat.com>
      Acked-by: default avatarFlorian Fainelli <f.fainelli@gmail.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      f38c5ad7
    • David S. Miller's avatar
      Merge branch 'bridge-per-vlan-dst_metadata-support' · 3b19860c
      David S. Miller authored
      Roopa Prabhu says:
      
      ====================
      bridge: per vlan dst_metadata support
      
      High level summary:
      lwt and dst_metadata have enabled vxlan l3 deployments
      to use a single vxlan netdev for multiple vnis eliminating the scalability
      problem with using a single vxlan netdev per vni. This series tries to
      do the same for vxlan netdevs in pure l2 bridged networks.
      Use-case/deployment and details are below.
      
      Deployment scerario details:
      As we know VXLAN is used to build layer 2 virtual networks across the
      underlay layer3 infrastructure. A VXLAN tunnel endpoint (VTEP)
      originates and terminates VXLAN tunnels. And a VTEP can be a TOR switch
      or a vswitch in the hypervisor. This patch series mainly
      focuses on the TOR switch configured as a Vtep. Vxlan segment ID (vni)
      along with vlan id is used to identify layer 2 segments in a vxlan
      overlay network. Vxlan bridging is the function provided by Vteps to terminate
      vxlan tunnels and map the vxlan vni to traditional end host vlan. This is
      covered in the "VXLAN Deployment Scenarios" in sections 6 and 6.1 in RFC 7348.
      To provide vxlan bridging function, a vtep has to map vlan to a vni. The rfc
      says that the ingress VTEP device shall remove the IEEE 802.1Q VLAN tag in
      the original Layer 2 packet if there is one before encapsulating the packet
      into the VXLAN format to transmit it through the underlay network. The remote
      VTEP devices have information about the VLAN in which the packet will be
      placed based on their own VLAN-to-VXLAN VNI mapping configurations.
      
      Existing solution:
      Without this patch series one can deploy such a vtep configuration by
      adding the local ports and vxlan netdevs into a vlan filtering bridge.
      The local ports are configured as trunk ports carrying all vlans.
      A vxlan netdev per vni is added to the bridge. Vlan mapping to vni is
      achieved by configuring the vlan as pvid on the corresponding vxlan netdev.
      The vxlan netdev only receives traffic corresponding to the vlan it is mapped
      to. This configuration maps traffic belonging to a vlan to the corresponding
      vxlan segment.
      
                -----------------------------------
               |              bridge               |
               |                                   |
                -----------------------------------
                  |100,200       |100 (pvid)    |200 (pvid)
                  |              |              |
                 swp1          vxlan1000      vxlan2000
      
      This provides the required vxlan bridging function but poses a
      scalability problem with using a separate vxlan netdev for each vni.
      
      Solution in this patch series:
      The Goal is to use a single vxlan device to carry all vnis similar
      to the vxlan collect metadata mode but additionally allowing the bridge
      and vxlan driver to carry all the forwarding information and also learn.
      This implementation uses the existing dst_metadata infrastructure to map
      vlan to a tunnel id.
      - vxlan driver changes:
          - enable collect metadata mode to be used with learning,
            replication and fdb
          - A single fdb table hashed by (mac, vni)
          - rx path already has the vni
          - tx path expects a vni in the packet with dst_metadata and relies
            on learnt or static forwarding information table to forward the packet
      
      - Bridge driver changes: per vlan dst_metadata support:
          - Our use case is vxlan and 1-1 mapping between vlan and vni, but I have
            kept the api generic for any tunnel info
          - Uapi to configure/unconfigure/dump per vlan tunnel data
          - new bridge port flag to turn this feature on/off. off by default
          - ingress hook:
              - if port is a tunnel port, use tunnel info in
                attached dst_metadata to map it to a local vlan
          - egress hook:
              - if port is a tunnel port, use tunnel info attached to vlan
                to set dst_metadata on the skb
      
      Other approaches tried and vetoed:
      - tc vlan push/pop and tunnel metadata dst:
          - though tc can be used to do part of this, these patches address a deployment
            case where bridge driver vlan filtering and forwarding information
            database along with vxlan driver forwarding information table and learning
            are required.
      - making vxlan driver understand vlan-vni mapping:
          - I had a series almost ready with this one but soon realized
            it duplicated a lot of vlan handling code in the vxlan driver
      ====================
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      3b19860c
    • Roopa Prabhu's avatar
      bridge: vlan dst_metadata hooks in ingress and egress paths · 11538d03
      Roopa Prabhu authored
      - ingress hook:
          - if port is a tunnel port, use tunnel info in
            attached dst_metadata to map it to a local vlan
      - egress hook:
          - if port is a tunnel port, use tunnel info attached to
            vlan to set dst_metadata on the skb
      
      CC: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
      Signed-off-by: default avatarRoopa Prabhu <roopa@cumulusnetworks.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      11538d03
    • Roopa Prabhu's avatar
      bridge: per vlan dst_metadata netlink support · efa5356b
      Roopa Prabhu authored
      This patch adds support to attach per vlan tunnel info dst
      metadata. This enables bridge driver to map vlan to tunnel_info
      at ingress and egress. It uses the kernel dst_metadata infrastructure.
      
      The initial use case is vlan to vni bridging, but the api is generic
      to extend to any tunnel_info in the future:
          - Uapi to configure/unconfigure/dump per vlan tunnel data
          - netlink functions to configure vlan and tunnel_info mapping
          - Introduces bridge port flag BR_LWT_VLAN to enable attach/detach
          dst_metadata to bridged packets on ports. off by default.
          - changes to existing code is mainly refactor some existing vlan
          handling netlink code + hooks for new vlan tunnel code
          - I have kept the vlan tunnel code isolated in separate files.
          - most of the netlink vlan tunnel code is handling of vlan-tunid
          ranges (follows the vlan range handling code). To conserve space
          vlan-tunid by default are always dumped in ranges if applicable.
      
      Use case:
      example use for this is a vxlan bridging gateway or vtep
      which maps vlans to vn-segments (or vnis).
      
      iproute2 example (patched and pruned iproute2 output to just show
      relevant fdb entries):
      example shows same host mac learnt on two vni's and
      vlan 100 maps to vni 1000, vlan 101 maps to vni 1001
      
      before (netdev per vni):
      $bridge fdb show | grep "00:02:00:00:00:03"
      00:02:00:00:00:03 dev vxlan1001 vlan 101 master bridge
      00:02:00:00:00:03 dev vxlan1001 dst 12.0.0.8 self
      00:02:00:00:00:03 dev vxlan1000 vlan 100 master bridge
      00:02:00:00:00:03 dev vxlan1000 dst 12.0.0.8 self
      
      after this patch with collect metdata in bridged mode (single netdev):
      $bridge fdb show | grep "00:02:00:00:00:03"
      00:02:00:00:00:03 dev vxlan0 vlan 101 master bridge
      00:02:00:00:00:03 dev vxlan0 src_vni 1001 dst 12.0.0.8 self
      00:02:00:00:00:03 dev vxlan0 vlan 100 master bridge
      00:02:00:00:00:03 dev vxlan0 src_vni 1000 dst 12.0.0.8 self
      
      CC: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
      Signed-off-by: default avatarRoopa Prabhu <roopa@cumulusnetworks.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      efa5356b
    • Roopa Prabhu's avatar
      bridge: uapi: add per vlan tunnel info · b3c7ef0a
      Roopa Prabhu authored
      New nested netlink attribute to associate tunnel info per vlan.
      This is used by bridge driver to send tunnel metadata to
      bridge ports in vlan tunnel mode. This patch also adds new per
      port flag IFLA_BRPORT_VLAN_TUNNEL to enable vlan tunnel mode.
      off by default.
      
      One example use for this is a vxlan bridging gateway or vtep
      which maps vlans to vn-segments (or vnis). User can configure
      per-vlan tunnel information which the bridge driver can use
      to bridge vlan into the corresponding vn-segment.
      Signed-off-by: default avatarRoopa Prabhu <roopa@cumulusnetworks.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      b3c7ef0a
    • Roopa Prabhu's avatar
      vxlan: support fdb and learning in COLLECT_METADATA mode · 3ad7a4b1
      Roopa Prabhu authored
      Vxlan COLLECT_METADATA mode today solves the per-vni netdev
      scalability problem in l3 networks. It expects all forwarding
      information to be present in dst_metadata. This patch series
      enhances collect metadata mode to include the case where only
      vni is present in dst_metadata, and the vxlan driver can then use
      the rest of the forwarding information datbase to make forwarding
      decisions. There is no change to default COLLECT_METADATA
      behaviour. These changes only apply to COLLECT_METADATA when
      used with the bridging use-case with a special dst_metadata
      tunnel info flag (eg: where vxlan device is part of a bridge).
      For all this to work, the vxlan driver will need to now support a
      single fdb table hashed by mac + vni. This series essentially makes
      this happen.
      
      use-case and workflow:
      vxlan collect metadata device participates in bridging vlan
      to vn-segments. Bridge driver above the vxlan device,
      sends the vni corresponding to the vlan in the dst_metadata.
      vxlan driver will lookup forwarding database with (mac + vni)
      for the required remote destination information to forward the
      packet.
      
      Changes introduced by this patch:
          - allow learning and forwarding database state in vxlan netdev in
            COLLECT_METADATA mode. Current behaviour is not changed
            by default. tunnel info flag IP_TUNNEL_INFO_BRIDGE is used
            to support the new bridge friendly mode.
          - A single fdb table hashed by (mac, vni) to allow fdb entries with
            multiple vnis in the same fdb table
          - rx path already has the vni
          - tx path expects a vni in the packet with dst_metadata
          - prior to this series, fdb remote_dsts carried remote vni and
            the vxlan device carrying the fdb table represented the
            source vni. With the vxlan device now representing multiple vnis,
            this patch adds a src vni attribute to the fdb entry. The remote
            vni already uses NDA_VNI attribute. This patch introduces
            NDA_SRC_VNI netlink attribute to represent the src vni in a multi
            vni fdb table.
      
      iproute2 example (patched and pruned iproute2 output to just show
      relevant fdb entries):
      example shows same host mac learnt on two vni's.
      
      before (netdev per vni):
      $bridge fdb show | grep "00:02:00:00:00:03"
      00:02:00:00:00:03 dev vxlan1001 dst 12.0.0.8 self
      00:02:00:00:00:03 dev vxlan1000 dst 12.0.0.8 self
      
      after this patch with collect metadata in bridged mode (single netdev):
      $bridge fdb show | grep "00:02:00:00:00:03"
      00:02:00:00:00:03 dev vxlan0 src_vni 1001 dst 12.0.0.8 self
      00:02:00:00:00:03 dev vxlan0 src_vni 1000 dst 12.0.0.8 self
      Signed-off-by: default avatarRoopa Prabhu <roopa@cumulusnetworks.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      3ad7a4b1