1. 09 Oct, 2017 16 commits
    • Emil Tantilov's avatar
      ixgbe: Clear SWFW_SYNC register during init · 2e22a75c
      Emil Tantilov authored
      Added clearing of SW resource bits in the SW/FW synchronization
      register to ixgbe_init_swfw_sync_X540().
      
      Updated ixgbe_acquire_swfw_sync_X540 SW Manageability host interface
      resource bit error case to match the error handling of the other SW
      resource bits. Which is to release the SW resource bits if SW times
      out while attempting to acquire the resource.
      
      This allows the driver to load in cases where the semaphore bits
      could be stuck after a reset or a crash.
      Signed-off-by: default avatarEmil Tantilov <emil.s.tantilov@intel.com>
      Tested-by: default avatarAndrew Bowers <andrewx.bowers@intel.com>
      Signed-off-by: default avatarJeff Kirsher <jeffrey.t.kirsher@intel.com>
      2e22a75c
    • Christos Gkekas's avatar
      qed: Delete redundant check on dcb_app priority · c49c777f
      Christos Gkekas authored
      dcb_app priority is unsigned thus checking whether it is less than zero
      is redundant.
      Signed-off-by: default avatarChristos Gkekas <chris.gekas@gmail.com>
      Acked-By: default avatarTomer Tayar <Tomer.Tayar@cavium.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      c49c777f
    • Christos Gkekas's avatar
      net: ethernet: stmmac: Clean up dead code · c778c321
      Christos Gkekas authored
      Many macros in dwmac-ipq806x are unused and should be removed.
      Moreover gmac->id is an unsigned variable and therefore checking
      whether it is less than zero is redundant.
      Signed-off-by: default avatarChristos Gkekas <chris.gekas@gmail.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      c778c321
    • David S. Miller's avatar
      Merge branch 'ipv6_dev_get_saddr-rcu' · bf6a119e
      David S. Miller authored
      Eric Dumazet says:
      
      ====================
      ipv6: ipv6_dev_get_saddr() rcu works
      
      Sending IPv6 udp packets on non connected sockets is quite slow,
      because ipv6_dev_get_saddr() is still using an rwlock and silly
      references games on ifa.
      
      Tested:
      
      $ ./super_netperf 16 -H 4444::555:0786 -l 2000 -t UDP_STREAM -- -m 100 &
      [1] 12527
      
      Performance is boosted from 2.02 Mpps to 4.28 Mpps
      
      Kernel profile before patches :
        22.62%  [kernel]  [k] _raw_read_lock_bh
         7.04%  [kernel]  [k] refcount_sub_and_test
         6.56%  [kernel]  [k] ipv6_get_saddr_eval
         5.67%  [kernel]  [k] _raw_read_unlock_bh
         5.34%  [kernel]  [k] __ipv6_dev_get_saddr
         4.95%  [kernel]  [k] refcount_inc_not_zero
         4.03%  [kernel]  [k] __ip6addrlbl_match
         3.70%  [kernel]  [k] _raw_spin_lock
         3.44%  [kernel]  [k] ipv6_dev_get_saddr
         3.24%  [kernel]  [k] ip6_pol_route
         3.06%  [kernel]  [k] refcount_add_not_zero
         2.30%  [kernel]  [k] __local_bh_enable_ip
         1.81%  [kernel]  [k] mlx4_en_xmit
         1.20%  [kernel]  [k] __ip6_append_data
         1.12%  [kernel]  [k] __ip6_make_skb
         1.11%  [kernel]  [k] __dev_queue_xmit
         1.06%  [kernel]  [k] l3mdev_master_ifindex_rcu
      
      Kernel profile after patches :
        11.36%  [kernel]  [k] ip6_pol_route
         7.65%  [kernel]  [k] _raw_spin_lock
         7.16%  [kernel]  [k] __ipv6_dev_get_saddr
         6.49%  [kernel]  [k] ipv6_get_saddr_eval
         6.04%  [kernel]  [k] refcount_add_not_zero
         3.34%  [kernel]  [k] __ip6addrlbl_match
         2.62%  [kernel]  [k] __dev_queue_xmit
         2.37%  [kernel]  [k] mlx4_en_xmit
         2.26%  [kernel]  [k] dst_release
         1.89%  [kernel]  [k] __ip6_make_skb
         1.87%  [kernel]  [k] __ip6_append_data
         1.86%  [kernel]  [k] udpv6_sendmsg
         1.86%  [kernel]  [k] ip6t_do_table
         1.64%  [kernel]  [k] ipv6_dev_get_saddr
         1.64%  [kernel]  [k] find_match
         1.51%  [kernel]  [k] l3mdev_master_ifindex_rcu
         1.24%  [kernel]  [k] ipv6_addr_label
      ====================
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      bf6a119e
    • Eric Dumazet's avatar
      ipv6: avoid cache line dirtying in ipv6_dev_get_saddr() · cc429c8f
      Eric Dumazet authored
      By extending the rcu section a bit, we can avoid these
      very expensive in6_ifa_put()/in6_ifa_hold() calls
      done in __ipv6_dev_get_saddr() and ipv6_dev_get_saddr()
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      cc429c8f
    • Eric Dumazet's avatar
      ipv6: __ipv6_dev_get_saddr() rcu conversion · f59c031e
      Eric Dumazet authored
      Callers hold rcu_read_lock(), so we do not need
      the rcu_read_lock()/rcu_read_unlock() pair.
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      f59c031e
    • Eric Dumazet's avatar
      24ba333b
    • Eric Dumazet's avatar
      47e26941
    • Eric Dumazet's avatar
      d9bf82c2
    • Eric Dumazet's avatar
      ipv6: prepare RCU lookups for idev->addr_list · 8ef802aa
      Eric Dumazet authored
      inet6_ifa_finish_destroy() already uses kfree_rcu() to free
      inet6_ifaddr structs.
      
      We need to use proper list additions/deletions in order
      to allow readers to use RCU instead of idev->lock rwlock.
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      8ef802aa
    • David S. Miller's avatar
      Merge branch 'bridge-neigh-msg-proxy-and-flood-suppression-support' · a4231778
      David S. Miller authored
      Roopa Prabhu says:
      
      ====================
      bridge: neigh msg proxy and flood suppression support
      
      This series implements arp and nd suppression in the bridge
      driver for ethernet vpns. It implements rfc7432, section 10
      https://tools.ietf.org/html/rfc7432#section-10
      for ethernet VPN deployments. It is similar to the existing
      BR_PROXYARP* flags but has a few semantic differences to conform
      to EVPN standard. Unlike the existing flags, this new flag suppresses
      flood of all neigh discovery packets (arp and nd) to tunnel ports.
      Supports both vlan filtering and non-vlan filtering bridges.
      
      In case of EVPN, it is mainly used to avoid flooding
      of arp and nd packets to tunnel ports like vxlan.
      
      v2 : rebase to latest + address some optimization feedback from Nikolay.
      v3 : fix kbuild reported build errors with CONFIG_INET off
      v4 : simplify port flag mask as suggested by stephen
      v5 : address some feedback from Toshiaki
      v6 : some v5 cleanups in nd suppress (keep it consistent with arp suppress)
      ====================
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      a4231778
    • Roopa Prabhu's avatar
      bridge: suppress nd pkts on BR_NEIGH_SUPPRESS ports · ed842fae
      Roopa Prabhu authored
      This patch avoids flooding and proxies ndisc packets
      for BR_NEIGH_SUPPRESS ports.
      Signed-off-by: default avatarRoopa Prabhu <roopa@cumulusnetworks.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      ed842fae
    • Roopa Prabhu's avatar
      bridge: suppress arp pkts on BR_NEIGH_SUPPRESS ports · 057658cb
      Roopa Prabhu authored
      This patch avoids flooding and proxies arp packets
      for BR_NEIGH_SUPPRESS ports.
      
      Moves existing br_do_proxy_arp to br_do_proxy_suppress_arp
      to support both proxy arp and neigh suppress.
      Signed-off-by: default avatarRoopa Prabhu <roopa@cumulusnetworks.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      057658cb
    • Roopa Prabhu's avatar
      bridge: add new BR_NEIGH_SUPPRESS port flag to suppress arp and nd flood · 821f1b21
      Roopa Prabhu authored
      This patch adds a new bridge port flag BR_NEIGH_SUPPRESS to
      suppress arp and nd flood on bridge ports. It implements
      rfc7432, section 10.
      https://tools.ietf.org/html/rfc7432#section-10
      for ethernet VPN deployments. It is similar to the existing
      BR_PROXYARP* flags but has a few semantic differences to conform
      to EVPN standard. Unlike the existing flags, this new flag suppresses
      flood of all neigh discovery packets (arp and nd) to tunnel ports.
      Supports both vlan filtering and non-vlan filtering bridges.
      
      In case of EVPN, it is mainly used to avoid flooding
      of arp and nd packets to tunnel ports like vxlan.
      
      This patch adds netlink and sysfs support to set this bridge port
      flag.
      Signed-off-by: default avatarRoopa Prabhu <roopa@cumulusnetworks.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      821f1b21
    • Eric Dumazet's avatar
      ipv6: fix a BUG in rt6_get_pcpu_route() · 951f788a
      Eric Dumazet authored
      Ido reported following splat and provided a patch.
      
      [  122.221814] BUG: using smp_processor_id() in preemptible [00000000] code: sshd/2672
      [  122.221845] caller is debug_smp_processor_id+0x17/0x20
      [  122.221866] CPU: 0 PID: 2672 Comm: sshd Not tainted 4.14.0-rc3-idosch-next-custom #639
      [  122.221880] Hardware name: Mellanox Technologies Ltd. MSN2100-CB2FO/SA001017, BIOS 5.6.5 06/07/2016
      [  122.221893] Call Trace:
      [  122.221919]  dump_stack+0xb1/0x10c
      [  122.221946]  ? _atomic_dec_and_lock+0x124/0x124
      [  122.221974]  ? ___ratelimit+0xfe/0x240
      [  122.222020]  check_preemption_disabled+0x173/0x1b0
      [  122.222060]  debug_smp_processor_id+0x17/0x20
      [  122.222083]  ip6_pol_route+0x1482/0x24a0
      ...
      
      I believe we can simplify this code path a bit, since we no longer
      hold a read_lock and need to release it to avoid a dead lock.
      
      By disabling BH, we make sure we'll prevent code re-entry and
      rt6_get_pcpu_route()/rt6_make_pcpu_route() run on the same cpu.
      
      Fixes: 66f5d6ce ("ipv6: replace rwlock with rcu and spinlock in fib6_table")
      Reported-by: default avatarIdo Schimmel <idosch@mellanox.com>
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Tested-by: default avatarIdo Schimmel <idosch@mellanox.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      951f788a
    • David S. Miller's avatar
      Merge tag 'mlx5-updates-2017-10-06' of git://git.kernel.org/pub/scm/linux/kernel/git/mellanox/linux · 51a0c00c
      David S. Miller authored
      Saeed Mahameed says:
      
      ====================
      Mellanox, mlx5 updates 2017-10-06
      
      This series includes some shared code updates for kernel 4.15 to both
      net-next and rdma-next trees.
      
      The series includes mlx5 low level flow steering updates and optimizations
      to support firmware command parallelism for flow steering requests from
      Maor Gottlieb and two other small fixes from Matan and Maor.
      
      One fix from Matan adds error handling for when the destination
      list of the flow steering rule is full.
      
      Maor introduced a patch to avoid NULL pointer dereference on steering cleanup.
      
      Then Some refactoring patches needed by the series for code sharing purposes.
      and split the Flow Table Entry (FTE) and Flow Group (FG) creation code to two parts:
          1) Object allocation - allocate the steering node and initialize
          its resources.
      
          2) The firmware command execution.
      
      This change will give us the ability to take write lock on the
      parent node (e.g. FG for FTE creating) only on the software data struct allocation
      and creation part of the procedure where the synchronization is really required,
      and will allow us to execute multiple firmware commands simultaneously and overcome the
      firmware bottleneck.
      
      Refactor the locking scheme of the mlx5 core flow steering as follows:
      
      1) Replace the mutex lock with readers-writers semaphore and take
          the write lock only when necessary (e.g. allocating a new flow
          table entry index or adding a node to the parent's children list).
          When we try to find a suitable child in the parent's children list
          (e.g. search for flow group with the same match_criteria of the rule)
          then we only take the read lock.
      
      2) Add versioning mechanism - each steering entity (FT, FG, FTE, DST)
          will have an incremental version. The version is increased when the
          entity is changed (e.g. when a new FTE was added to FG - the FG's
          version is increased).
          Versioning is used in order to determine if the last traverse of an
          entity's children is valid or a rescan under write lock is required.
      
      Last patch adds FGs and FTEs memory pool, It is useful because these objects
      are not small and could be allocated/deallocated many times.
      
      This support improves the insertion rate of steering rules
      from ~5k/sec to ~40k/sec.
      ====================
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      51a0c00c
  2. 08 Oct, 2017 7 commits
  3. 07 Oct, 2017 17 commits
    • David S. Miller's avatar
      Merge branch 'bpf-obj-name-misc' · c9f766bc
      David S. Miller authored
      Martin KaFai Lau says:
      
      ====================
      bpf: Misc improvements and a new usage on bpf obj name
      
      The first two patches make improvements on the bpf obj name.
      
      The last patch adds the prog name to kallsyms.
      ====================
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      c9f766bc
    • Martin KaFai Lau's avatar
      bpf: Append prog->aux->name in bpf_get_prog_name() · 368211fb
      Martin KaFai Lau authored
      This patch makes the bpf_prog's name available
      in kallsyms.
      
      The new format is bpf_prog_tag[_name].
      
      Sample kallsyms from running selftests/bpf/test_progs:
      [root@arch-fb-vm1 ~]# egrep ' bpf_prog_[0-9a-fA-F]{16}' /proc/kallsyms
      ffffffffa0048000 t bpf_prog_dabf0207d1992486_test_obj_id
      ffffffffa0038000 t bpf_prog_a04f5eef06a7f555__123456789ABCDE
      ffffffffa0050000 t bpf_prog_a04f5eef06a7f555
      Signed-off-by: default avatarMartin KaFai Lau <kafai@fb.com>
      Acked-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: default avatarAlexei Starovoitov <ast@fb.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      368211fb
    • Martin KaFai Lau's avatar
      bpf: Use char in prog and map name · 067cae47
      Martin KaFai Lau authored
      Instead of u8, use char for prog and map name.  It can avoid the
      userspace tool getting compiler's signess warning.  The
      bpf_prog_aux, bpf_map, bpf_attr, bpf_prog_info and
      bpf_map_info are changed.
      Signed-off-by: default avatarMartin KaFai Lau <kafai@fb.com>
      Cc: Jakub Kicinski <jakub.kicinski@netronome.com>
      Acked-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: default avatarAlexei Starovoitov <ast@fb.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      067cae47
    • Martin KaFai Lau's avatar
      bpf: Change bpf_obj_name_cpy() to better ensure map's name is init by 0 · 473d9734
      Martin KaFai Lau authored
      During get_info_by_fd, the prog/map name is memcpy-ed.  It depends
      on the prog->aux->name and map->name to be zero initialized.
      
      bpf_prog_aux is easy to guarantee that aux->name is zero init.
      
      The name in bpf_map may be harder to be guaranteed in the future when
      new map type is added.
      
      Hence, this patch makes bpf_obj_name_cpy() to always zero init
      the prog/map name.
      Suggested-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Signed-off-by: default avatarMartin KaFai Lau <kafai@fb.com>
      Acked-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      473d9734
    • William Tu's avatar
      ip_gre: check packet length and mtu correctly in erspan tx · f192970d
      William Tu authored
      Similarly to early patch for erspan_xmit(), the ARPHDR_ETHER device
      is the length of the whole ether packet.  So skb->len should subtract
      the dev->hard_header_len.
      
      Fixes: 1a66a836 ("gre: add collect_md mode to ERSPAN tunnel")
      Fixes: 84e54fe0 ("gre: introduce native tunnel support for ERSPAN")
      Signed-off-by: default avatarWilliam Tu <u9012063@gmail.com>
      Cc: Xin Long <lucien.xin@gmail.com>
      Cc: David Laight <David.Laight@aculab.com>
      Reviewed-by: default avatarXin Long <lucien.xin@gmail.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      f192970d
    • Lin Zhang's avatar
      net: phonet: mark phonet_protocol as const · 548ec114
      Lin Zhang authored
      The phonet_protocol structs don't need to be written by anyone and
      so can be marked as const.
      Signed-off-by: default avatarLin Zhang <xiaolou4617@gmail.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      548ec114
    • Lin Zhang's avatar
      net: phonet: mark header_ops as const · 64237470
      Lin Zhang authored
      Signed-off-by: default avatarLin Zhang <xiaolou4617@gmail.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      64237470
    • David S. Miller's avatar
      Merge branch 'bpf-perf-time-helpers' · a1d753d2
      David S. Miller authored
      Yonghong Song says:
      
      ====================
      bpf: add two helpers to read perf event enabled/running time
      
      Hardware pmu counters are limited resources. When there are more
      pmu based perf events opened than available counters, kernel will
      multiplex these events so each event gets certain percentage
      (but not 100%) of the pmu time. In case that multiplexing happens,
      the number of samples or counter value will not reflect the
      case compared to no multiplexing. This makes comparison between
      different runs difficult.
      
      Typically, the number of samples or counter value should be
      normalized before comparing to other experiments. The typical
      normalization is done like:
        normalized_num_samples = num_samples * time_enabled / time_running
        normalized_counter_value = counter_value * time_enabled / time_running
      where time_enabled is the time enabled for event and time_running is
      the time running for event since last normalization.
      
      This patch set implements two helper functions.
      The helper bpf_perf_event_read_value reads counter/time_enabled/time_running
      for perf event array map. The helper bpf_perf_prog_read_value read
      counter/time_enabled/time_running for bpf prog with type BPF_PROG_TYPE_PERF_EVENT.
      ====================
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      a1d753d2
    • Yonghong Song's avatar
      bpf: add a test case for helper bpf_perf_prog_read_value · 81b9cf80
      Yonghong Song authored
      The bpf sample program trace_event is enhanced to use the new
      helper to print out enabled/running time.
      Signed-off-by: default avatarYonghong Song <yhs@fb.com>
      Acked-by: default avatarAlexei Starovoitov <ast@fb.com>
      Acked-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      81b9cf80
    • Yonghong Song's avatar
      bpf: add helper bpf_perf_prog_read_value · 4bebdc7a
      Yonghong Song authored
      This patch adds helper bpf_perf_prog_read_cvalue for perf event based bpf
      programs, to read event counter and enabled/running time.
      The enabled/running time is accumulated since the perf event open.
      
      The typical use case for perf event based bpf program is to attach itself
      to a single event. In such cases, if it is desirable to get scaling factor
      between two bpf invocations, users can can save the time values in a map,
      and use the value from the map and the current value to calculate
      the scaling factor.
      Signed-off-by: default avatarYonghong Song <yhs@fb.com>
      Acked-by: default avatarAlexei Starovoitov <ast@fb.com>
      Acked-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      4bebdc7a
    • Yonghong Song's avatar
      bpf: add a test case for helper bpf_perf_event_read_value · 020a32d9
      Yonghong Song authored
      The bpf sample program tracex6 is enhanced to use the new
      helper to read enabled/running time as well.
      Signed-off-by: default avatarYonghong Song <yhs@fb.com>
      Acked-by: default avatarAlexei Starovoitov <ast@fb.com>
      Acked-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      020a32d9
    • Yonghong Song's avatar
      bpf: add helper bpf_perf_event_read_value for perf event array map · 908432ca
      Yonghong Song authored
      Hardware pmu counters are limited resources. When there are more
      pmu based perf events opened than available counters, kernel will
      multiplex these events so each event gets certain percentage
      (but not 100%) of the pmu time. In case that multiplexing happens,
      the number of samples or counter value will not reflect the
      case compared to no multiplexing. This makes comparison between
      different runs difficult.
      
      Typically, the number of samples or counter value should be
      normalized before comparing to other experiments. The typical
      normalization is done like:
        normalized_num_samples = num_samples * time_enabled / time_running
        normalized_counter_value = counter_value * time_enabled / time_running
      where time_enabled is the time enabled for event and time_running is
      the time running for event since last normalization.
      
      This patch adds helper bpf_perf_event_read_value for kprobed based perf
      event array map, to read perf counter and enabled/running time.
      The enabled/running time is accumulated since the perf event open.
      To achieve scaling factor between two bpf invocations, users
      can can use cpu_id as the key (which is typical for perf array usage model)
      to remember the previous value and do the calculation inside the
      bpf program.
      Signed-off-by: default avatarYonghong Song <yhs@fb.com>
      Acked-by: default avatarAlexei Starovoitov <ast@fb.com>
      Acked-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      908432ca
    • Yonghong Song's avatar
      bpf: perf event change needed for subsequent bpf helpers · 97562633
      Yonghong Song authored
      This patch does not impact existing functionalities.
      It contains the changes in perf event area needed for
      subsequent bpf_perf_event_read_value and
      bpf_perf_prog_read_value helpers.
      Signed-off-by: default avatarYonghong Song <yhs@fb.com>
      Acked-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      97562633
    • Amine Kherbouche's avatar
      ip_tunnel: add mpls over gre support · bdc47641
      Amine Kherbouche authored
      This commit introduces the MPLSoGRE support (RFC 4023), using ip tunnel
      API by simply adding ipgre_tunnel_encap_(add|del)_mpls_ops() and the new
      tunnel type TUNNEL_ENCAP_MPLS.
      Signed-off-by: default avatarAmine Kherbouche <amine.kherbouche@6wind.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      bdc47641
    • David S. Miller's avatar
      Merge branch 'fib6-rcu' · 2af48d43
      David S. Miller authored
      Wei Wang says:
      
      ====================
      ipv6: replace rwlock with rcu and spinlock in fib6 table
      
      Currently, fib6 table is protected by rwlock. During route lookup,
      reader lock is taken and during route insertion, deletion or
      modification, writer lock is taken. This is a very inefficient
      implementation because the fastpath always has to do the operation
      to grab the reader lock.
      According to my latest syn flood test on an iota ivybridage machine
      with 2 10G mlx nics bonded together, each with 8 rx queues on 2 NUMA
      nodes, and with the upstream net-next kernel:
      ipv4 stack can handle around 4.2Mpps
      ipv6 stack can handle around 1.3Mpps
      
      In order to close the gap of the performance number between ipv4
      and ipv6 stack, this patch series tries to get rid of the usage of
      the rwlock and replace it with rcu and spinlock protection. This will
      greatly speed up the fastpath performance as it only needs to hold
      rcu which is much less expensive than grabbing the reader lock. It
      also makes ipv6 fib implementation more consistent with ipv4.
      
      In order to be able to replace the current rwlock with rcu and
      spinlock, some preparation work is needed:
      Patch 1-8 introduces a per-route hash table (protected by rcu and a
      different spinlock) to store all cached routes created by pmtu and ip
      redirect under its main route. This makes the main fib6 tree only
      contain static routes.
      Patch 9-14 prepares all the reader path to be ready to tolerate
      concurrent writer.
      Patch 15 finally does the rwlock to rcu and spinlock conversion.
      Patch 16 takes care of rt6_stats.
      
      After this patch series, in the same syn flood test,
      ipv6 stack can now handle around 3.5Mpps compared to previous 1.3Mpps
      in my test setup.
      
      After this patch series, there are still some improvements that should
      be done in ipv6 stack:
      1. During route lookup, dst_use() is called everytime on the selected
      route to update dst->__use and dst->lastuse. This dirties the cacheline
      and causes extra cacheline miss and should be avoided.
      2. when no route is found in the current table, net->ip6.ipv6_null_entry
      is used and refcnt is taken on it. As there is no pcpu cache for this
      specific route, frequent change on the refcnt for this route causes
      quite some cacheline misses.
      And to make things worse, if CONFIG_IPV6_MULTIPLE_TABLES is defined,
      output path route lookup always starts with local table first and
      guarantees to hit net->ipv6.ip6_null_entry before continuing to do
      lookup in the main table.
      These operations on net->ipv6.ip6_null_entry could potentially be
      avoided.
      3. ipv6 input path route lookup grabs refcnt on dst. This is different
      from ipv4. We could potentially change this behavior to let ipv6 input
      path route lookup not to grab refcnt on dst. However, it does not give
      us much performance boost as we currently have pcpu route cache for
      input path as well in ipv6. But this work probably is still worth doing
      to unify ipv6 and ipv4 route lookup behavior.
      
      The above issues will be addressed separately after this patch series
      has been accepted.
      
      This is a joint work with Martin KaFai Lau and Eric Dumazet. And many
      many thanks to them for their inspiring ideas and big big code review
      efforts.
      ====================
      Reviewed-by: default avatarEric Dumazet <edumazet@google.com>
      Reviewed-by: default avatarYOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      2af48d43
    • Wei Wang's avatar
      ipv6: take care of rt6_stats · 81eb8447
      Wei Wang authored
      Currently, most of the rt6_stats are not hooked up correctly. As the
      last part of this patch series, hook up all existing rt6_stats and add
      one new stat fib_rt_uncache to indicate the number of routes in the
      uncached list.
      For details of the stats, please refer to the comments added in
      include/net/ip6_fib.h.
      
      Note: fib_rt_alloc and fib_rt_uncache are not guaranteed to be modified
      under a lock. So atomic_t is used for them.
      Signed-off-by: default avatarWei Wang <weiwan@google.com>
      Signed-off-by: default avatarMartin KaFai Lau <kafai@fb.com>
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      81eb8447
    • Wei Wang's avatar
      ipv6: replace rwlock with rcu and spinlock in fib6_table · 66f5d6ce
      Wei Wang authored
      With all the preparation work before, we are now ready to replace rwlock
      with rcu and spinlock in fib6_table.
      That means now all fib6_node in fib6_table are protected by rcu. And
      when freeing fib6_node, call_rcu() is used to wait for the rcu grace
      period before releasing the memory.
      When accessing fib6_node, corresponding rcu APIs need to be used.
      And all previous sessions protected by the write lock will now be
      protected by the spin lock per table.
      All previous sessions protected by read lock will now be protected by
      rcu_read_lock().
      
      A couple of things to note here:
      1. As part of the work of replacing rwlock with rcu, the linked list of
      fn->leaf now has to be rcu protected as well. So both fn->leaf and
      rt->dst.rt6_next are now __rcu tagged and corresponding rcu APIs are
      used when manipulating them.
      
      2. For fn->rr_ptr, first of all, it also needs to be rcu protected now
      and is tagged with __rcu and rcu APIs are used in corresponding places.
      Secondly, fn->rr_ptr is changed in rt6_select() which is a reader
      thread. This makes the issue a bit complicated. We think a valid
      solution for it is to let rt6_select() grab the tb6_lock if it decides
      to change it. As it is not in the normal operation and only happens when
      there is no valid neighbor cache for the route, we think the performance
      impact should be low.
      
      3. fib6_walk_continue() has to be called with tb6_lock held even in the
      route dumping related functions, e.g. inet6_dump_fib(),
      fib6_tables_dump() and ipv6_route_seq_ops. It is because
      fib6_walk_continue() makes modifications to the walker structure, and so
      are fib6_repair_tree() and fib6_del_route(). In order to do proper
      syncing between them, we need to let fib6_walk_continue() hold the lock.
      We may be able to do further improvement on the way we do the tree walk
      to get rid of the need for holding the spin lock. But not for now.
      
      4. When fib6_del_route() removes a route from the tree, we no longer
      mark rt->dst.rt6_next to NULL to make simultaneous reader be able to
      further traverse the list with rcu. However, rt->dst.rt6_next is only
      valid within this same rcu period. No one should access it later.
      
      5. All the operation of atomic_inc(rt->rt6i_ref) is changed to be
      performed before we publish this route (either by linking it to fn->leaf
      or insert it in the list pointed by fn->leaf) just to be safe because as
      soon as we publish the route, some read thread will be able to access it.
      Signed-off-by: default avatarWei Wang <weiwan@google.com>
      Signed-off-by: default avatarMartin KaFai Lau <kafai@fb.com>
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      66f5d6ce