1. 07 Oct, 2017 30 commits
    • Martin KaFai Lau's avatar
      bpf: Use char in prog and map name · 067cae47
      Martin KaFai Lau authored
      Instead of u8, use char for prog and map name.  It can avoid the
      userspace tool getting compiler's signess warning.  The
      bpf_prog_aux, bpf_map, bpf_attr, bpf_prog_info and
      bpf_map_info are changed.
      Signed-off-by: default avatarMartin KaFai Lau <kafai@fb.com>
      Cc: Jakub Kicinski <jakub.kicinski@netronome.com>
      Acked-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: default avatarAlexei Starovoitov <ast@fb.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      067cae47
    • Martin KaFai Lau's avatar
      bpf: Change bpf_obj_name_cpy() to better ensure map's name is init by 0 · 473d9734
      Martin KaFai Lau authored
      During get_info_by_fd, the prog/map name is memcpy-ed.  It depends
      on the prog->aux->name and map->name to be zero initialized.
      
      bpf_prog_aux is easy to guarantee that aux->name is zero init.
      
      The name in bpf_map may be harder to be guaranteed in the future when
      new map type is added.
      
      Hence, this patch makes bpf_obj_name_cpy() to always zero init
      the prog/map name.
      Suggested-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Signed-off-by: default avatarMartin KaFai Lau <kafai@fb.com>
      Acked-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      473d9734
    • William Tu's avatar
      ip_gre: check packet length and mtu correctly in erspan tx · f192970d
      William Tu authored
      Similarly to early patch for erspan_xmit(), the ARPHDR_ETHER device
      is the length of the whole ether packet.  So skb->len should subtract
      the dev->hard_header_len.
      
      Fixes: 1a66a836 ("gre: add collect_md mode to ERSPAN tunnel")
      Fixes: 84e54fe0 ("gre: introduce native tunnel support for ERSPAN")
      Signed-off-by: default avatarWilliam Tu <u9012063@gmail.com>
      Cc: Xin Long <lucien.xin@gmail.com>
      Cc: David Laight <David.Laight@aculab.com>
      Reviewed-by: default avatarXin Long <lucien.xin@gmail.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      f192970d
    • Lin Zhang's avatar
      net: phonet: mark phonet_protocol as const · 548ec114
      Lin Zhang authored
      The phonet_protocol structs don't need to be written by anyone and
      so can be marked as const.
      Signed-off-by: default avatarLin Zhang <xiaolou4617@gmail.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      548ec114
    • Lin Zhang's avatar
      net: phonet: mark header_ops as const · 64237470
      Lin Zhang authored
      Signed-off-by: default avatarLin Zhang <xiaolou4617@gmail.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      64237470
    • David S. Miller's avatar
      Merge branch 'bpf-perf-time-helpers' · a1d753d2
      David S. Miller authored
      Yonghong Song says:
      
      ====================
      bpf: add two helpers to read perf event enabled/running time
      
      Hardware pmu counters are limited resources. When there are more
      pmu based perf events opened than available counters, kernel will
      multiplex these events so each event gets certain percentage
      (but not 100%) of the pmu time. In case that multiplexing happens,
      the number of samples or counter value will not reflect the
      case compared to no multiplexing. This makes comparison between
      different runs difficult.
      
      Typically, the number of samples or counter value should be
      normalized before comparing to other experiments. The typical
      normalization is done like:
        normalized_num_samples = num_samples * time_enabled / time_running
        normalized_counter_value = counter_value * time_enabled / time_running
      where time_enabled is the time enabled for event and time_running is
      the time running for event since last normalization.
      
      This patch set implements two helper functions.
      The helper bpf_perf_event_read_value reads counter/time_enabled/time_running
      for perf event array map. The helper bpf_perf_prog_read_value read
      counter/time_enabled/time_running for bpf prog with type BPF_PROG_TYPE_PERF_EVENT.
      ====================
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      a1d753d2
    • Yonghong Song's avatar
      bpf: add a test case for helper bpf_perf_prog_read_value · 81b9cf80
      Yonghong Song authored
      The bpf sample program trace_event is enhanced to use the new
      helper to print out enabled/running time.
      Signed-off-by: default avatarYonghong Song <yhs@fb.com>
      Acked-by: default avatarAlexei Starovoitov <ast@fb.com>
      Acked-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      81b9cf80
    • Yonghong Song's avatar
      bpf: add helper bpf_perf_prog_read_value · 4bebdc7a
      Yonghong Song authored
      This patch adds helper bpf_perf_prog_read_cvalue for perf event based bpf
      programs, to read event counter and enabled/running time.
      The enabled/running time is accumulated since the perf event open.
      
      The typical use case for perf event based bpf program is to attach itself
      to a single event. In such cases, if it is desirable to get scaling factor
      between two bpf invocations, users can can save the time values in a map,
      and use the value from the map and the current value to calculate
      the scaling factor.
      Signed-off-by: default avatarYonghong Song <yhs@fb.com>
      Acked-by: default avatarAlexei Starovoitov <ast@fb.com>
      Acked-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      4bebdc7a
    • Yonghong Song's avatar
      bpf: add a test case for helper bpf_perf_event_read_value · 020a32d9
      Yonghong Song authored
      The bpf sample program tracex6 is enhanced to use the new
      helper to read enabled/running time as well.
      Signed-off-by: default avatarYonghong Song <yhs@fb.com>
      Acked-by: default avatarAlexei Starovoitov <ast@fb.com>
      Acked-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      020a32d9
    • Yonghong Song's avatar
      bpf: add helper bpf_perf_event_read_value for perf event array map · 908432ca
      Yonghong Song authored
      Hardware pmu counters are limited resources. When there are more
      pmu based perf events opened than available counters, kernel will
      multiplex these events so each event gets certain percentage
      (but not 100%) of the pmu time. In case that multiplexing happens,
      the number of samples or counter value will not reflect the
      case compared to no multiplexing. This makes comparison between
      different runs difficult.
      
      Typically, the number of samples or counter value should be
      normalized before comparing to other experiments. The typical
      normalization is done like:
        normalized_num_samples = num_samples * time_enabled / time_running
        normalized_counter_value = counter_value * time_enabled / time_running
      where time_enabled is the time enabled for event and time_running is
      the time running for event since last normalization.
      
      This patch adds helper bpf_perf_event_read_value for kprobed based perf
      event array map, to read perf counter and enabled/running time.
      The enabled/running time is accumulated since the perf event open.
      To achieve scaling factor between two bpf invocations, users
      can can use cpu_id as the key (which is typical for perf array usage model)
      to remember the previous value and do the calculation inside the
      bpf program.
      Signed-off-by: default avatarYonghong Song <yhs@fb.com>
      Acked-by: default avatarAlexei Starovoitov <ast@fb.com>
      Acked-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      908432ca
    • Yonghong Song's avatar
      bpf: perf event change needed for subsequent bpf helpers · 97562633
      Yonghong Song authored
      This patch does not impact existing functionalities.
      It contains the changes in perf event area needed for
      subsequent bpf_perf_event_read_value and
      bpf_perf_prog_read_value helpers.
      Signed-off-by: default avatarYonghong Song <yhs@fb.com>
      Acked-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      97562633
    • Amine Kherbouche's avatar
      ip_tunnel: add mpls over gre support · bdc47641
      Amine Kherbouche authored
      This commit introduces the MPLSoGRE support (RFC 4023), using ip tunnel
      API by simply adding ipgre_tunnel_encap_(add|del)_mpls_ops() and the new
      tunnel type TUNNEL_ENCAP_MPLS.
      Signed-off-by: default avatarAmine Kherbouche <amine.kherbouche@6wind.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      bdc47641
    • David S. Miller's avatar
      Merge branch 'fib6-rcu' · 2af48d43
      David S. Miller authored
      Wei Wang says:
      
      ====================
      ipv6: replace rwlock with rcu and spinlock in fib6 table
      
      Currently, fib6 table is protected by rwlock. During route lookup,
      reader lock is taken and during route insertion, deletion or
      modification, writer lock is taken. This is a very inefficient
      implementation because the fastpath always has to do the operation
      to grab the reader lock.
      According to my latest syn flood test on an iota ivybridage machine
      with 2 10G mlx nics bonded together, each with 8 rx queues on 2 NUMA
      nodes, and with the upstream net-next kernel:
      ipv4 stack can handle around 4.2Mpps
      ipv6 stack can handle around 1.3Mpps
      
      In order to close the gap of the performance number between ipv4
      and ipv6 stack, this patch series tries to get rid of the usage of
      the rwlock and replace it with rcu and spinlock protection. This will
      greatly speed up the fastpath performance as it only needs to hold
      rcu which is much less expensive than grabbing the reader lock. It
      also makes ipv6 fib implementation more consistent with ipv4.
      
      In order to be able to replace the current rwlock with rcu and
      spinlock, some preparation work is needed:
      Patch 1-8 introduces a per-route hash table (protected by rcu and a
      different spinlock) to store all cached routes created by pmtu and ip
      redirect under its main route. This makes the main fib6 tree only
      contain static routes.
      Patch 9-14 prepares all the reader path to be ready to tolerate
      concurrent writer.
      Patch 15 finally does the rwlock to rcu and spinlock conversion.
      Patch 16 takes care of rt6_stats.
      
      After this patch series, in the same syn flood test,
      ipv6 stack can now handle around 3.5Mpps compared to previous 1.3Mpps
      in my test setup.
      
      After this patch series, there are still some improvements that should
      be done in ipv6 stack:
      1. During route lookup, dst_use() is called everytime on the selected
      route to update dst->__use and dst->lastuse. This dirties the cacheline
      and causes extra cacheline miss and should be avoided.
      2. when no route is found in the current table, net->ip6.ipv6_null_entry
      is used and refcnt is taken on it. As there is no pcpu cache for this
      specific route, frequent change on the refcnt for this route causes
      quite some cacheline misses.
      And to make things worse, if CONFIG_IPV6_MULTIPLE_TABLES is defined,
      output path route lookup always starts with local table first and
      guarantees to hit net->ipv6.ip6_null_entry before continuing to do
      lookup in the main table.
      These operations on net->ipv6.ip6_null_entry could potentially be
      avoided.
      3. ipv6 input path route lookup grabs refcnt on dst. This is different
      from ipv4. We could potentially change this behavior to let ipv6 input
      path route lookup not to grab refcnt on dst. However, it does not give
      us much performance boost as we currently have pcpu route cache for
      input path as well in ipv6. But this work probably is still worth doing
      to unify ipv6 and ipv4 route lookup behavior.
      
      The above issues will be addressed separately after this patch series
      has been accepted.
      
      This is a joint work with Martin KaFai Lau and Eric Dumazet. And many
      many thanks to them for their inspiring ideas and big big code review
      efforts.
      ====================
      Reviewed-by: default avatarEric Dumazet <edumazet@google.com>
      Reviewed-by: default avatarYOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      2af48d43
    • Wei Wang's avatar
      ipv6: take care of rt6_stats · 81eb8447
      Wei Wang authored
      Currently, most of the rt6_stats are not hooked up correctly. As the
      last part of this patch series, hook up all existing rt6_stats and add
      one new stat fib_rt_uncache to indicate the number of routes in the
      uncached list.
      For details of the stats, please refer to the comments added in
      include/net/ip6_fib.h.
      
      Note: fib_rt_alloc and fib_rt_uncache are not guaranteed to be modified
      under a lock. So atomic_t is used for them.
      Signed-off-by: default avatarWei Wang <weiwan@google.com>
      Signed-off-by: default avatarMartin KaFai Lau <kafai@fb.com>
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      81eb8447
    • Wei Wang's avatar
      ipv6: replace rwlock with rcu and spinlock in fib6_table · 66f5d6ce
      Wei Wang authored
      With all the preparation work before, we are now ready to replace rwlock
      with rcu and spinlock in fib6_table.
      That means now all fib6_node in fib6_table are protected by rcu. And
      when freeing fib6_node, call_rcu() is used to wait for the rcu grace
      period before releasing the memory.
      When accessing fib6_node, corresponding rcu APIs need to be used.
      And all previous sessions protected by the write lock will now be
      protected by the spin lock per table.
      All previous sessions protected by read lock will now be protected by
      rcu_read_lock().
      
      A couple of things to note here:
      1. As part of the work of replacing rwlock with rcu, the linked list of
      fn->leaf now has to be rcu protected as well. So both fn->leaf and
      rt->dst.rt6_next are now __rcu tagged and corresponding rcu APIs are
      used when manipulating them.
      
      2. For fn->rr_ptr, first of all, it also needs to be rcu protected now
      and is tagged with __rcu and rcu APIs are used in corresponding places.
      Secondly, fn->rr_ptr is changed in rt6_select() which is a reader
      thread. This makes the issue a bit complicated. We think a valid
      solution for it is to let rt6_select() grab the tb6_lock if it decides
      to change it. As it is not in the normal operation and only happens when
      there is no valid neighbor cache for the route, we think the performance
      impact should be low.
      
      3. fib6_walk_continue() has to be called with tb6_lock held even in the
      route dumping related functions, e.g. inet6_dump_fib(),
      fib6_tables_dump() and ipv6_route_seq_ops. It is because
      fib6_walk_continue() makes modifications to the walker structure, and so
      are fib6_repair_tree() and fib6_del_route(). In order to do proper
      syncing between them, we need to let fib6_walk_continue() hold the lock.
      We may be able to do further improvement on the way we do the tree walk
      to get rid of the need for holding the spin lock. But not for now.
      
      4. When fib6_del_route() removes a route from the tree, we no longer
      mark rt->dst.rt6_next to NULL to make simultaneous reader be able to
      further traverse the list with rcu. However, rt->dst.rt6_next is only
      valid within this same rcu period. No one should access it later.
      
      5. All the operation of atomic_inc(rt->rt6i_ref) is changed to be
      performed before we publish this route (either by linking it to fn->leaf
      or insert it in the list pointed by fn->leaf) just to be safe because as
      soon as we publish the route, some read thread will be able to access it.
      Signed-off-by: default avatarWei Wang <weiwan@google.com>
      Signed-off-by: default avatarMartin KaFai Lau <kafai@fb.com>
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      66f5d6ce
    • Wei Wang's avatar
      ipv6: add key length check into rt6_select() · 17ecf590
      Wei Wang authored
      After rwlock is replaced with rcu and spinlock, fib6_lookup() could
      potentially return an intermediate node if other thread is doing
      fib6_del() on a route which is the only route on the node so that
      fib6_repair_tree() will be called on this node and potentially assigns
      fn->leaf to the its child's fn->leaf.
      
      In order to detect this situation in rt6_select(), we have to check if
      fn->fn_bit is consistent with the key length stored in the route. And
      depending on if the fn is in the subtree or not, the key is either
      rt->rt6i_dst or rt->rt6i_src.
      If any inconsistency is found, that means the node no longer holds valid
      routes in it. So net->ipv6.ip6_null_entry is returned.
      Signed-off-by: default avatarWei Wang <weiwan@google.com>
      Signed-off-by: default avatarMartin KaFai Lau <kafai@fb.com>
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      17ecf590
    • Wei Wang's avatar
      ipv6: check fn->leaf before it is used · 8d1040e8
      Wei Wang authored
      If rwlock is replaced with rcu and spinlock, it is possible that the
      reader thread will see fn->leaf as NULL in the following scenarios:
      1. fib6_add() is in progress and we have already inserted a new node but
      not yet inserted the route.
      2. fib6_del_route() is in progress and we have already set fn->leaf to
      NULL but not yet freed the node because of rcu grace period.
      
      This patch makes sure all the reader threads check fn->leaf first before
      using it. And together with later patch to grab rcu_read_lock() and
      rcu_dereference() fn->leaf, it makes sure reader threads are safe when
      accessing fn->leaf.
      Signed-off-by: default avatarWei Wang <weiwan@google.com>
      Signed-off-by: default avatarMartin KaFai Lau <kafai@fb.com>
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      8d1040e8
    • Wei Wang's avatar
      ipv6: update fn_sernum after route is inserted to tree · bbd63f06
      Wei Wang authored
      fib6_add() logic currently calls fib6_add_1() to figure out what node
      should be used for the newly added route and then call
      fib6_add_rt2node() to insert the route to the node.
      And during the call of fib6_add_1(), fn_sernum is updated for all nodes
      that share the same prefix as the new route.
      This does not have issue in the current code because reader thread will
      not be able to access the tree while writer thread is inserting new
      route to it. However, it is not the case once we transition to use RCU.
      Reader thread could potentially see the new fn_sernum before the new
      route is inserted. As a result, reader thread's route lookup will return
      a stale route with the new fn_sernum.
      
      In order to solve this issue, we remove all the update of fn_sernum in
      fib6_add_1(), and instead, introduce a new function that updates fn_sernum
      for all related nodes and call this functions once the route is
      successfully inserted to the tree.
      Also, smp_wmb() is used after a route is successfully inserted into the
      fib tree and right before the updated of fn->sernum. And smp_rmb() is
      used right after fn->sernum is accessed in rt6_get_cookie_safe(). This
      is to guarantee that when the reader thread sees the new fn->sernum, the
      new route is already inserted in the tree in memory.
      Signed-off-by: default avatarWei Wang <weiwan@google.com>
      Signed-off-by: default avatarMartin KaFai Lau <kafai@fb.com>
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      bbd63f06
    • Wei Wang's avatar
      ipv6: replace dst_hold() with dst_hold_safe() in routing code · d3843fe5
      Wei Wang authored
      With rwlock, it is safe to call dst_hold() in the read thread because
      read thread is guaranteed to be separated from write thread.
      However, after we replace rwlock with rcu, it is no longer safe to use
      dst_hold(). A dst might already have been deleted but is waiting for the
      rcu grace period to pass before freeing the memory when a read thread is
      trying to do dst_hold(). This could potentially cause double free issue.
      
      So this commit replaces all dst_hold() with dst_hold_safe() in all read
      thread to avoid this double free issue.
      And in order to make the code more compact, a new function ip6_hold_safe()
      is introduced. It calls dst_hold_safe() first, and if that fails, it will
      either fall back to hold and return net->ipv6.ip6_null_entry or set rt to
      NULL according to the caller's need.
      Signed-off-by: default avatarWei Wang <weiwan@google.com>
      Signed-off-by: default avatarMartin KaFai Lau <kafai@fb.com>
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      d3843fe5
    • Wei Wang's avatar
      ipv6: don't release rt->rt6i_pcpu memory during rt6_release() · 51e398e8
      Wei Wang authored
      After rwlock is replaced with rcu and spinlock, route lookup can happen
      simultanously with route deletion.
      This patch removes the call to free_percpu(rt->rt6i_pcpu) from
      rt6_release() to avoid the race condition between rt6_release() and
      rt6_get_pcpu_route(). And as free_percpu(rt->rt6i_pcpu) is already
      called in ip6_dst_destroy() after the rcu grace period, it is safe to do
      this change.
      Signed-off-by: default avatarWei Wang <weiwan@google.com>
      Signed-off-by: default avatarMartin KaFai Lau <kafai@fb.com>
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      51e398e8
    • Wei Wang's avatar
      ipv6: grab rt->rt6i_ref before allocating pcpu rt · a94b9367
      Wei Wang authored
      After rwlock is replaced with rcu and spinlock, ip6_pol_route() will be
      called with only rcu held. That means rt6 route deletion could happen
      simultaneously with rt6_make_pcpu_rt(). This could potentially cause
      memory leak if rt6_release() is called right before rt6_make_pcpu_rt()
      on the same route.
      
      This patch grabs rt->rt6i_ref safely before calling rt6_make_pcpu_rt()
      to make sure rt6_release() will not get triggered while
      rt6_make_pcpu_rt() is in progress. And rt6_release() is called after
      rt6_make_pcpu_rt() is finished.
      
      Note: As we are incrementing rt->rt6i_ref in ip6_pol_route(), there is a
      very slim chance that fib6_purge_rt() will be triggered unnecessarily
      when deleting a route if ip6_pol_route() running on another thread picks
      this route as well and tries to make pcpu cache for it.
      Signed-off-by: default avatarWei Wang <weiwan@google.com>
      Signed-off-by: default avatarMartin KaFai Lau <kafai@fb.com>
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      a94b9367
    • Wei Wang's avatar
      ipv6: hook up exception table to store dst cache · 2b760fcf
      Wei Wang authored
      This commit makes use of the exception hash table implementation to
      store dst caches created by pmtu discovery and ip redirect into the hash
      table under the rt_info and no longer inserts these routes into fib6
      tree.
      This makes the fib6 tree only contain static configured routes and could
      now be protected by rcu instead of a rw lock.
      With this change, in the route lookup related functions, after finding
      the rt6_info with the longest prefix, we also need to search for the
      exception table before doing backtracking.
      In the route delete function, if the route being deleted is not a dst
      cache, deletion of this route also need to flush the whole hash table
      under it. If it is a dst cache, then only delete the cached dst in the
      hash table.
      
      Note: for fib6_walk_continue() function, w->root now is always pointing
      to a root node considering that fib6_prune_clones() is removed from the
      code. So we add a WARN_ON() msg to make sure w->root always points to a
      root node and also removed the update of w->root in fib6_repair_tree().
      This is a prerequisite for later patch because we don't need to make
      w->root as rcu protected when replacing rwlock with RCU.
      Also, we remove all prune related variables as it is no longer used.
      Signed-off-by: default avatarWei Wang <weiwan@google.com>
      Signed-off-by: default avatarMartin KaFai Lau <kafai@fb.com>
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      2b760fcf
    • Wei Wang's avatar
      ipv6: prepare fib6_locate() for exception table · 38fbeeee
      Wei Wang authored
      fib6_locate() is used to find the fib6_node according to the passed in
      prefix address key. It currently tries to find the fib6_node with the
      exact match of the passed in key. However, when we move cached routes
      into the exception table, fib6_locate() will fail to find the fib6_node
      for it as the cached routes will be stored in the exception table under
      the fib6_node with the longest prefix match of the cache's dst addr key.
      This commit adds a new parameter to let the caller specify if it needs
      exact match or longest prefix match.
      Right now, all callers still does exact match when calling
      fib6_locate(). It will be changed in later commit where exception table
      is hooked up to store cached routes.
      Signed-off-by: default avatarWei Wang <weiwan@google.com>
      Signed-off-by: default avatarMartin KaFai Lau <kafai@fb.com>
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      38fbeeee
    • Wei Wang's avatar
      ipv6: prepare fib6_age() for exception table · c757faa8
      Wei Wang authored
      If all dst cache entries are stored in the exception table under the
      main route, we have to go through them during fib6_age() when doing
      garbage collecting.
      Introduce a new function rt6_age_exception() which goes through all dst
      entries in the exception table and remove those entries that are expired.
      This function is called in fib6_age() so that all dst caches are also
      garbage collected.
      Signed-off-by: default avatarWei Wang <weiwan@google.com>
      Signed-off-by: default avatarMartin KaFai Lau <kafai@fb.com>
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      c757faa8
    • Wei Wang's avatar
      ipv6: prepare rt6_clean_tohost() for exception table · b16cb459
      Wei Wang authored
      If we move all cached dst into the exception table under the main route,
      current rt6_clean_tohost() will no longer be able to access them.
      This commit makes fib6_clean_tohost() to also go through all cached
      routes in exception table and removes cached gateway routes to the
      passed in gateway.
      This is a preparation in order to move all cached routes into the
      exception table.
      Signed-off-by: default avatarWei Wang <weiwan@google.com>
      Signed-off-by: default avatarMartin KaFai Lau <kafai@fb.com>
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      b16cb459
    • Wei Wang's avatar
      ipv6: prepare rt6_mtu_change() for exception table · f5bbe7ee
      Wei Wang authored
      If we move all cached dst into the exception table under the main route,
      current rt6_mtu_change() will no longer be able to access them.
      This commit makes rt6_mtu_change_route() function to also go through all
      cached routes in the exception table under the main route and do proper
      updates on the mtu.
      This is a preparation in order to move all cached routes into the
      exception table.
      Signed-off-by: default avatarWei Wang <weiwan@google.com>
      Signed-off-by: default avatarMartin KaFai Lau <kafai@fb.com>
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      f5bbe7ee
    • Wei Wang's avatar
      ipv6: prepare fib6_remove_prefsrc() for exception table · 60006a48
      Wei Wang authored
      After we move cached dst entries into the exception table under its
      parent route, current fib6_remove_prefsrc() no longer can access them.
      This commit makes fib6_remove_prefsrc() also go through all routes
      in the exception table to remove the pref src.
      This is a preparation patch in order to move all cached dst into the
      exception table.
      Signed-off-by: default avatarWei Wang <weiwan@google.com>
      Signed-off-by: default avatarMartin KaFai Lau <kafai@fb.com>
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      60006a48
    • Wei Wang's avatar
      ipv6: introduce a hash table to store dst cache · 35732d01
      Wei Wang authored
      Add a hash table into struct rt6_info in order to store dst caches
      created by pmtu discovery and ip redirect in ipv6 routing code.
      APIs to add dst cache, delete dst cache, find dst cache and update
      dst cache in the hash table are implemented and will be used in later
      commits.
      This is a preparation work to move all cache routes into the exception
      table instead of getting inserted into the fib6 tree.
      Signed-off-by: default avatarWei Wang <weiwan@google.com>
      Signed-off-by: default avatarMartin KaFai Lau <kafai@fb.com>
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      35732d01
    • Wei Wang's avatar
      ipv6: introduce a new function fib6_update_sernum() · 180ca444
      Wei Wang authored
      This function takes a route as input and tries to update the sernum in
      the fib6_node this route is associated with. It will be used in later
      commit when adding a cached route into the exception table under that
      route.
      Signed-off-by: default avatarWei Wang <weiwan@google.com>
      Signed-off-by: default avatarMartin KaFai Lau <kafai@fb.com>
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      180ca444
    • Jonathan Toppins's avatar
      bnxt_en: don't consider building bnxt_tc.o if option not enabled · 0d7b70e8
      Jonathan Toppins authored
      Instead of zeroing out bnxt_tc.c with a #ifdef foo, instead don't compile
      the file when the option is not enabled. Now make and the preprocessor do
      not have to waste time compiling a no-op.
      Signed-off-by: default avatarJonathan Toppins <jtoppins@redhat.com>
      Acked-by: default avatarMichael Chan <michael.chan@broadcom.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      0d7b70e8
  2. 06 Oct, 2017 10 commits
    • David S. Miller's avatar
      Merge branch 'tcp-rbtree-retransmit-queue' · ca822141
      David S. Miller authored
      Eric Dumazet says:
      
      ====================
      tcp: implement rb-tree based retransmit queue
      
      This patch series implement RB-tree based retransmit queue for TCP,
      to better match modern BDP.
      
      Tested:
      
       On receiver :
       netem on ingress : delay 150ms 200us loss 1
       GRO disabled to force stress and SACK storms.
      
      for f in `seq 1 10`
      do
       ./netperf -H lpaa6 -l30 -- -K bbr -o THROUGHPUT|tail -1
      done | awk '{print $0} {sum += $0} END {printf "%7u\n",sum}'
      
      Before patch :
      
      323.87  351.48  339.59  338.62  306.72
      204.07  304.93  291.88  202.47  176.88
      ->   2840
      
      After patch:
      
      1700.83 2207.98 2070.17 1544.26 2114.76
      2124.89 1693.14 1080.91 2216.82 1299.94
      ->  18053
      
      Average of 1805 Mbits istead of 284 Mbits.
      ====================
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      ca822141
    • Eric Dumazet's avatar
      tcp: implement rb-tree based retransmit queue · 75c119af
      Eric Dumazet authored
      Using a linear list to store all skbs in write queue has been okay
      for quite a while : O(N) is not too bad when N < 500.
      
      Things get messy when N is the order of 100,000 : Modern TCP stacks
      want 10Gbit+ of throughput even with 200 ms RTT flows.
      
      40 ns per cache line miss means a full scan can use 4 ms,
      blowing away CPU caches.
      
      SACK processing often can use various hints to avoid parsing
      whole retransmit queue. But with high packet losses and/or high
      reordering, hints no longer work.
      
      Sender has to process thousands of unfriendly SACK, accumulating
      a huge socket backlog, burning a cpu and massively dropping packets.
      
      Using an rb-tree for retransmit queue has been avoided for years
      because it added complexity and overhead, but now is the time
      to be more resistant and say no to quadratic behavior.
      
      1) RTX queue is no longer part of the write queue : already sent skbs
      are stored in one rb-tree.
      
      2) Since reaching the head of write queue no longer needs
      sk->sk_send_head, we added an union of sk_send_head and tcp_rtx_queue
      
      Tested:
      
       On receiver :
       netem on ingress : delay 150ms 200us loss 1
       GRO disabled to force stress and SACK storms.
      
      for f in `seq 1 10`
      do
       ./netperf -H lpaa6 -l30 -- -K bbr -o THROUGHPUT|tail -1
      done | awk '{print $0} {sum += $0} END {printf "%7u\n",sum}'
      
      Before patch :
      
      323.87
      351.48
      339.59
      338.62
      306.72
      204.07
      304.93
      291.88
      202.47
      176.88
         2840
      
      After patch:
      
      1700.83
      2207.98
      2070.17
      1544.26
      2114.76
      2124.89
      1693.14
      1080.91
      2216.82
      1299.94
        18053
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      75c119af
    • Eric Dumazet's avatar
      tcp: pass previous skb to tcp_shifted_skb() · f3319816
      Eric Dumazet authored
      No need to recompute previous skb, as it will be a bit more
      expensive when rtx queue is converted to RB tree.
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      f3319816
    • Eric Dumazet's avatar
      tcp: reduce tcp_fastretrans_alert() verbosity · 8ba6ddaa
      Eric Dumazet authored
      With upcoming rb-tree implementation, the checks will trigger
      more often, and this is expected.
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      8ba6ddaa
    • Eric Dumazet's avatar
      tcp: tcp_mark_head_lost() optimization · 5e76ee4b
      Eric Dumazet authored
      It will be a bit more expensive to get the head of rtx queue
      once rtx queue is converted to an rb-tree.
      
      We can avoid this extra cost in case tp->lost_skb_hint is set.
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      5e76ee4b
    • Eric Dumazet's avatar
      tcp: tcp_tx_timestamp() cleanup · 4e8cc228
      Eric Dumazet authored
      tcp_write_queue_tail() call can be factorized.
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      4e8cc228
    • Eric Dumazet's avatar
      tcp: uninline tcp_write_queue_purge() · ac3f09ba
      Eric Dumazet authored
      Since the upcoming rtx rbtree will add some extra code,
      it is time to not inline this fat function anymore.
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      ac3f09ba
    • Eric Dumazet's avatar
      net: add rb_to_skb() and other rb tree helpers · 18a4c0ea
      Eric Dumazet authored
      Geeralize private netem_rb_to_skb()
      
      TCP rtx queue will soon be converted to rb-tree,
      so we will need skb_rbtree_walk() helpers.
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      18a4c0ea
    • David S. Miller's avatar
      Merge branch '40GbE' of ra.kernel.org:/pub/scm/linux/kernel/git/jkirsher/next-queue · f5333f80
      David S. Miller authored
      Jeff Kirsher says:
      
      ====================
      40GbE Intel Wired LAN Driver Updates 2017-10-06
      
      This series contains updates to i40e and i40evf only.
      
      Rami fixes a typo in the code comments.
      
      Mitch adds an ethtool private flag to control source pruning to resolve an
      issue where our default behavior is to enable source pruning which breaks ARP
      monitoring in channel bonding.  Fixes a couple of register definitions, which
      were incorrect.
      
      Jake fixes an issue with multiple logical CPUs per core (simultaneous
      multithreading - SMT) and how we set an affinity hint based on the v_idx of
      that q_vector, which is an incremental value and might lead to multiple
      offline CPUs being assigned to a q_vector.  Instead, we should only assign
      hints for CPUs which are online, so look to use cpumask_local_spread().
      Also fixed a VF VLAN tag stripping issue, where the flag created to change
      this feature was seen as unchangeable.  Lastly, organized and re-numbered
      the feature flags.
      
      Alan re-enables PTP L4 for XL710 devices with firmware version 6.0 or
      greater, now that the previous bug in the older firmware is fixed.
      Implements the PCI error handlers for reset_prepare() and reset_done() to
      allow us to handle function level resets.
      
      Alice cleans up code that was added to the incorrect function during a
      merge.
      
      Filip adds a change to display an error message when a module is inserted
      that does not meet the thermal requirements, Talking Heads "Burning Down
      the House" comes to mind.  Also fixed a flow director filter issue where
      a variable was not being cleared which stores the filter number to be
      removed from the list when the firmware refused to add the requested
      filter.
      ====================
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      f5333f80
    • David S. Miller's avatar
      Merge tag 'batadv-next-for-davem-20171006' of git://git.open-mesh.org/linux-merge · 4bc4e64c
      David S. Miller authored
      Simon Wunderlich says:
      
      ====================
      This cleanup patchset includes the following patches:
      
       - bump version strings, by Simon Wunderlich
      
       - Cleanup patches to make checkpatch happy, by Sven Eckelmann (3 patches)
      ====================
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      4bc4e64c