1. 09 Mar, 2016 12 commits
  2. 08 Mar, 2016 28 commits
    • David S. Miller's avatar
      Merge branch 'bpf-map-prealloc' · f14b488d
      David S. Miller authored
      Alexei Starovoitov says:
      
      ====================
      bpf: map pre-alloc
      
      v1->v2:
      . fix few issues spotted by Daniel
      . converted stackmap into pre-allocation as well
      . added a workaround for lockdep false positive
      . added pcpu_freelist_populate to be used by hashmap and stackmap
      
      this path set switches bpf hash map to use pre-allocation by default
      and introduces BPF_F_NO_PREALLOC flag to keep old behavior for cases
      where full map pre-allocation is too memory expensive.
      
      Some time back Daniel Wagner reported crashes when bpf hash map is
      used to compute time intervals between preempt_disable->preempt_enable
      and recently Tom Zanussi reported a dead lock in iovisor/bcc/funccount
      tool if it's used to count the number of invocations of kernel
      '*spin*' functions. Both problems are due to the recursive use of
      slub and can only be solved by pre-allocating all map elements.
      
      A lot of different solutions were considered. Many implemented,
      but at the end pre-allocation seems to be the only feasible answer.
      As far as pre-allocation goes it also was implemented 4 different ways:
      - simple free-list with single lock
      - percpu_ida with optimizations
      - blk-mq-tag variant customized for bpf use case
      - percpu_freelist
      For bpf style of alloc/free patterns percpu_freelist is the best
      and implemented in this patch set.
      Detailed performance numbers in patch 3.
      Patch 2 introduces percpu_freelist
      Patch 1 fixes simple deadlocks due to missing recursion checks
      Patch 5: converts stackmap to pre-allocation
      Patches 6-9: prepare test infra
      Patch 10: stress test for hash map infra. It attaches to spin_lock
      functions and bpf_map_update/delete are called from different contexts
      Patch 11: stress for bpf_get_stackid
      Patch 12: map performance test
      Reported-by: default avatarDaniel Wagner <daniel.wagner@bmw-carit.de>
      Reported-by: default avatarTom Zanussi <tom.zanussi@linux.intel.com>
      ====================
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      f14b488d
    • Alexei Starovoitov's avatar
      samples/bpf: test both pre-alloc and normal maps · c3f85cff
      Alexei Starovoitov authored
      extend test coveraged to include pre-allocated and run-time alloc maps
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      c3f85cff
    • Alexei Starovoitov's avatar
      samples/bpf: add map_flags to bpf loader · 89b97607
      Alexei Starovoitov authored
      note old loader is compatible with new kernel.
      map_flags are optional
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      89b97607
    • Alexei Starovoitov's avatar
      samples/bpf: move ksym_search() into library · 3622e7e4
      Alexei Starovoitov authored
      move ksym search from offwaketime into library to be reused
      in other tests
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      3622e7e4
    • Alexei Starovoitov's avatar
      samples/bpf: make map creation more verbose · 618ec9a7
      Alexei Starovoitov authored
      map creation is typically the first one to fail when rlimits are
      too low, not enough memory, etc
      Make this failure scenario more verbose
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      618ec9a7
    • Alexei Starovoitov's avatar
      bpf: convert stackmap to pre-allocation · 557c0c6e
      Alexei Starovoitov authored
      It was observed that calling bpf_get_stackid() from a kprobe inside
      slub or from spin_unlock causes similar deadlock as with hashmap,
      therefore convert stackmap to use pre-allocated memory.
      
      The call_rcu is no longer feasible mechanism, since delayed freeing
      causes bpf_get_stackid() to fail unpredictably when number of actual
      stacks is significantly less than user requested max_entries.
      Since elements are no longer freed into slub, we can push elements into
      freelist immediately and let them be recycled.
      However the very unlikley race between user space map_lookup() and
      program-side recycling is possible:
           cpu0                          cpu1
           ----                          ----
      user does lookup(stackidX)
      starts copying ips into buffer
                                         delete(stackidX)
                                         calls bpf_get_stackid()
      				   which recyles the element and
                                         overwrites with new stack trace
      
      To avoid user space seeing a partial stack trace consisting of two
      merged stack traces, do bucket = xchg(, NULL); copy; xchg(,bucket);
      to preserve consistent stack trace delivery to user space.
      Now we can move memset(,0) of left-over element value from critical
      path of bpf_get_stackid() into slow-path of user space lookup.
      Also disallow lookup() from bpf program, since it's useless and
      program shouldn't be messing with collected stack trace.
      
      Note that similar race between user space lookup and kernel side updates
      is also present in hashmap, but it's not a new race. bpf programs were
      always allowed to modify hash and array map elements while user space
      is copying them.
      
      Fixes: d5a3b1f6 ("bpf: introduce BPF_MAP_TYPE_STACK_TRACE")
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      557c0c6e
    • Alexei Starovoitov's avatar
    • Alexei Starovoitov's avatar
      bpf: pre-allocate hash map elements · 6c905981
      Alexei Starovoitov authored
      If kprobe is placed on spin_unlock then calling kmalloc/kfree from
      bpf programs is not safe, since the following dead lock is possible:
      kfree->spin_lock(kmem_cache_node->lock)...spin_unlock->kprobe->
      bpf_prog->map_update->kmalloc->spin_lock(of the same kmem_cache_node->lock)
      and deadlocks.
      
      The following solutions were considered and some implemented, but
      eventually discarded
      - kmem_cache_create for every map
      - add recursion check to slow-path of slub
      - use reserved memory in bpf_map_update for in_irq or in preempt_disabled
      - kmalloc via irq_work
      
      At the end pre-allocation of all map elements turned out to be the simplest
      solution and since the user is charged upfront for all the memory, such
      pre-allocation doesn't affect the user space visible behavior.
      
      Since it's impossible to tell whether kprobe is triggered in a safe
      location from kmalloc point of view, use pre-allocation by default
      and introduce new BPF_F_NO_PREALLOC flag.
      
      While testing of per-cpu hash maps it was discovered
      that alloc_percpu(GFP_ATOMIC) has odd corner cases and often
      fails to allocate memory even when 90% of it is free.
      The pre-allocation of per-cpu hash elements solves this problem as well.
      
      Turned out that bpf_map_update() quickly followed by
      bpf_map_lookup()+bpf_map_delete() is very common pattern used
      in many of iovisor/bcc/tools, so there is additional benefit of
      pre-allocation, since such use cases are must faster.
      
      Since all hash map elements are now pre-allocated we can remove
      atomic increment of htab->count and save few more cycles.
      
      Also add bpf_map_precharge_memlock() to check rlimit_memlock early to avoid
      large malloc/free done by users who don't have sufficient limits.
      
      Pre-allocation is done with vmalloc and alloc/free is done
      via percpu_freelist. Here are performance numbers for different
      pre-allocation algorithms that were implemented, but discarded
      in favor of percpu_freelist:
      
      1 cpu:
      pcpu_ida	2.1M
      pcpu_ida nolock	2.3M
      bt		2.4M
      kmalloc		1.8M
      hlist+spinlock	2.3M
      pcpu_freelist	2.6M
      
      4 cpu:
      pcpu_ida	1.5M
      pcpu_ida nolock	1.8M
      bt w/smp_align	1.7M
      bt no/smp_align	1.1M
      kmalloc		0.7M
      hlist+spinlock	0.2M
      pcpu_freelist	2.0M
      
      8 cpu:
      pcpu_ida	0.7M
      bt w/smp_align	0.8M
      kmalloc		0.4M
      pcpu_freelist	1.5M
      
      32 cpu:
      kmalloc		0.13M
      pcpu_freelist	0.49M
      
      pcpu_ida nolock is a modified percpu_ida algorithm without
      percpu_ida_cpu locks and without cross-cpu tag stealing.
      It's faster than existing percpu_ida, but not as fast as pcpu_freelist.
      
      bt is a variant of block/blk-mq-tag.c simlified and customized
      for bpf use case. bt w/smp_align is using cache line for every 'long'
      (similar to blk-mq-tag). bt no/smp_align allocates 'long'
      bitmasks continuously to save memory. It's comparable to percpu_ida
      and in some cases faster, but slower than percpu_freelist
      
      hlist+spinlock is the simplest free list with single spinlock.
      As expeceted it has very bad scaling in SMP.
      
      kmalloc is existing implementation which is still available via
      BPF_F_NO_PREALLOC flag. It's significantly slower in single cpu and
      in 8 cpu setup it's 3 times slower than pre-allocation with pcpu_freelist,
      but saves memory, so in cases where map->max_entries can be large
      and number of map update/delete per second is low, it may make
      sense to use it.
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      6c905981
    • Alexei Starovoitov's avatar
      bpf: introduce percpu_freelist · e19494ed
      Alexei Starovoitov authored
      Introduce simple percpu_freelist to keep single list of elements
      spread across per-cpu singly linked lists.
      
      /* push element into the list */
      void pcpu_freelist_push(struct pcpu_freelist *, struct pcpu_freelist_node *);
      
      /* pop element from the list */
      struct pcpu_freelist_node *pcpu_freelist_pop(struct pcpu_freelist *);
      
      The object is pushed to the current cpu list.
      Pop first trying to get the object from the current cpu list,
      if it's empty goes to the neigbour cpu list.
      
      For bpf program usage pattern the collision rate is very low,
      since programs push and pop the objects typically on the same cpu.
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      e19494ed
    • Alexei Starovoitov's avatar
      bpf: prevent kprobe+bpf deadlocks · b121d1e7
      Alexei Starovoitov authored
      if kprobe is placed within update or delete hash map helpers
      that hold bucket spin lock and triggered bpf program is trying to
      grab the spinlock for the same bucket on the same cpu, it will
      deadlock.
      Fix it by extending existing recursion prevention mechanism.
      
      Note, map_lookup and other tracing helpers don't have this problem,
      since they don't hold any locks and don't modify global data.
      bpf_trace_printk has its own recursive check and ok as well.
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      Acked-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      b121d1e7
    • David S. Miller's avatar
      Merge branch 'ipv6-per-netns-gc' · 8aba8b83
      David S. Miller authored
      Michal Kubecek says:
      
      ====================
      ipv6: per netns FIB6 walkers and garbage collector
      
      Commit 2ac3ac8f ("ipv6: prevent fib6_run_gc() contention") reduced
      the risk of contention on FIB6 garbage collector lock on systems with
      many CPUs. However, one of our customers can still observe heavy
      contention on fib6_gc_lock which can even trigger the soft lockup
      detector.
      
      This is caused by garbage collector running in forced mode from a timer.
      While there is one timer per network namespace, the instances of
      fib6_run_gc() running from them are protected by one global spinlock so
      that only one garbage collector can run at any moment and other
      namespaces have to wait. As most relevant data structures are separated
      per netns, there is little reason for garbage collectors blocking each
      other.
      
      Similar problem exists for walkers: changes in one tree do not need to
      adjust (and block) walkers traversing FIB trees in other namespaces.
      
      This series separates both the walkers infrastructure and garbage
      collector so that they work independently in network namespaces.
      
      v2: get rid of ifdef in ipv6_route_seq_setup_walk(), pass net from
      callers instead
      ====================
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      8aba8b83
    • Michal Kubeček's avatar
      ipv6: per netns FIB garbage collection · 3dc94f93
      Michal Kubeček authored
      One of our customers observed issues with FIB6 garbage collectors
      running in different network namespaces blocking each other, resulting
      in soft lockups (fib6_run_gc() initiated from timer runs always in
      forced mode).
      
      Now that FIB6 walkers are separated per namespace, there is no more need
      for instances of fib6_run_gc() in different namespaces blocking each
      other. There is still a call to icmp6_dst_gc() which operates on shared
      data but this function is protected by its own shared lock.
      Signed-off-by: default avatarMichal Kubecek <mkubecek@suse.cz>
      Reviewed-by: default avatarCong Wang <xiyou.wangcong@gmail.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      3dc94f93
    • Michal Kubeček's avatar
      ipv6: per netns fib6 walkers · 9a03cd8f
      Michal Kubeček authored
      The IPv6 FIB data structures are separated per network namespace but
      there is still only one global walkers list and one global walker list
      lock. This means changes in one namespace unnecessarily interfere with
      walkers in other namespaces.
      
      Replace the global list with per-netns lists (and give each its own
      lock).
      Signed-off-by: default avatarMichal Kubecek <mkubecek@suse.cz>
      Reviewed-by: default avatarCong Wang <xiyou.wangcong@gmail.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      9a03cd8f
    • Michal Kubeček's avatar
      ipv6: replace global gc_args with local variable · 3570df91
      Michal Kubeček authored
      Global variable gc_args is only used in fib6_run_gc() and functions
      called from it. As fib6_run_gc() makes sure there is at most one
      instance of fib6_clean_all() running at any moment, we can replace
      gc_args with a local variable which will be needed once multiple
      instances (per netns) of garbage collector are allowed.
      Signed-off-by: default avatarMichal Kubecek <mkubecek@suse.cz>
      Reviewed-by: default avatarCong Wang <xiyou.wangcong@gmail.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      3570df91
    • David S. Miller's avatar
      Merge branch 'bnxt_en-next' · 02daec7c
      David S. Miller authored
      Michael Chan says:
      
      ====================
      bnxt_en: Updates for net-next.
      
      Updates to support autoneg for all supported speeds, add PF port statistics,
      and Advanced Error Reporting.
      
      v2: Fixed patch 3 to not use parentheses on function return.
      ====================
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      02daec7c
    • Satish Baddipadige's avatar
      bnxt_en: Enable AER support. · 6316ea6d
      Satish Baddipadige authored
      Add pci_error_handler callbacks to support for pcie advanced error
      recovery.
      Signed-off-by: default avatarSatish Baddipadige <sbaddipa@broadcom.com>
      Signed-off-by: default avatarMichael Chan <michael.chan@broadcom.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      6316ea6d
    • Michael Chan's avatar
      bnxt_en: Include hardware port statistics in ethtool -S. · 8ddc9aaa
      Michael Chan authored
      Include the more useful port statistics in ethtool -S for the PF device.
      Signed-off-by: default avatarMichael Chan <michael.chan@broadcom.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      8ddc9aaa
    • Michael Chan's avatar
      bnxt_en: Include some hardware port statistics in ndo_get_stats64(). · 9947f83f
      Michael Chan authored
      Include some of the port error counters (e.g. crc) in ->ndo_get_stats64()
      for the PF device.
      Signed-off-by: default avatarMichael Chan <michael.chan@broadcom.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      9947f83f
    • Michael Chan's avatar
      bnxt_en: Add port statistics support. · 3bdf56c4
      Michael Chan authored
      Gather periodic port statistics if the device is PF and link is up.  This
      is triggered in bnxt_timer() every one second to request firmware to DMA
      the counters.
      Signed-off-by: default avatarMichael Chan <michael.chan@broadocm.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      3bdf56c4
    • Michael Chan's avatar
      bnxt_en: Extend autoneg to all speeds. · f1a082a6
      Michael Chan authored
      Allow all autoneg speeds aupported by firmware to be advertised.  If
      the advertising parameter is 0, then all supported speeds will be
      advertised.
      
      Remove BNXT_ALL_COPPER_ETHTOOL_SPEED which is no longer used as all
      supported speeds can be advertised.
      Signed-off-by: default avatarMichael Chan <mchan@broadcom.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      f1a082a6
    • Michael Chan's avatar
      bnxt_en: Use common function to get ethtool supported flags. · 4b32cacc
      Michael Chan authored
      The supported bits and advertising bits in ethtool have the same
      definitions.  The same is true for the firmware bits.  So use the
      common function to handle the conversion for both supported and
      advertising bits.
      
      v2: Don't use parentheses on function return.
      Signed-off-by: default avatarMichael Chan <mchan@broadcom.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      4b32cacc
    • Michael Chan's avatar
      bnxt_en: Add reporting of link partner advertisement. · 3277360e
      Michael Chan authored
      And report actual pause settings to ETHTOOL_GPAUSEPARAM to let ethtool
      resolve the actual pause settings.
      Signed-off-by: default avatarMichael Chan <mchan@broadcom.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      3277360e
    • Michael Chan's avatar
      bnxt_en: Refactor bnxt_fw_to_ethtool_advertised_spds(). · 27c4d578
      Michael Chan authored
      Include the conversion of pause bits and add one extra call layer so
      that the same refactored function can be reused to get the link partner
      advertisement bits.
      Signed-off-by: default avatarMichael Chan <mchan@broadcom.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      27c4d578
    • Kyeong Yoo's avatar
      net_sched: dsmark: use qdisc_dequeue_peeked() · f8b33d8e
      Kyeong Yoo authored
      This fix is for dsmark similar to commit 3557619f
      ("net_sched: prio: use qdisc_dequeue_peeked")
      and makes use of qdisc_dequeue_peeked() instead of direct dequeue() call.
      
      First time, wrr peeks dsmark, which will then peek into sfq.
      sfq dequeues an skb and it's stored in sch->gso_skb.
      Next time, wrr tries to dequeue from dsmark, which will call sfq dequeue
      directly. This results skipping the previously peeked skb.
      
      So changed dsmark dequeue to call qdisc_dequeue_peeked() instead to use
      peeked skb if exists.
      Signed-off-by: default avatarKyeong Yoo <kyeong.yoo@alliedtelesis.co.nz>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      f8b33d8e
    • David S. Miller's avatar
      Merge git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nf-next · 4c38cd61
      David S. Miller authored
      Pablo Neira Ayuso says:
      
      ====================
      Netfilter/IPVS updates for net-next
      
      The following patchset contains Netfilter updates for your net-next tree,
      they are:
      
      1) Remove useless debug message when deleting IPVS service, from
         Yannick Brosseau.
      
      2) Get rid of compilation warning when CONFIG_PROC_FS is unset in
         several spots of the IPVS code, from Arnd Bergmann.
      
      3) Add prandom_u32 support to nft_meta, from Florian Westphal.
      
      4) Remove unused variable in xt_osf, from Sudip Mukherjee.
      
      5) Don't calculate IP checksum twice from netfilter ipv4 defrag hook
         since fixing af_packet defragmentation issues, from Joe Stringer.
      
      6) On-demand hook registration for iptables from netns. Instead of
         registering the hooks for every available netns whenever we need
         one of the support tables, we register this on the specific netns
         that needs it, patchset from Florian Westphal.
      
      7) Add missing port range selection to nf_tables masquerading support.
      
      BTW, just for the record, there is a typo in the description of
      5f6c253e ("netfilter: bridge: register hooks only when bridge
      interface is added") that refers to the cluster match as deprecated, but
      it is actually the CLUSTERIP target (which registers hooks
      inconditionally) the one that is scheduled for removal.
      ====================
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      4c38cd61
    • David S. Miller's avatar
      Merge branch 'bpf-next' · d24ad3fc
      David S. Miller authored
      Daniel Borkmann says:
      
      ====================
      BPF updates
      
      Couple of misc updates to BPF, besides others this series adds
      bpf_csum_diff() to be used with L3 csums, allows for managing
      tunnel options for collect meta data mode, and enabling ipv6
      traffic class for collect meta data in vxlan specifically (geneve
      already supports it). For more details, please see individual
      patches.
      
      The series requires net to be merged into net-next first to
      avoid any further pending merge conflicts.
      ====================
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      d24ad3fc
    • Daniel Borkmann's avatar
      vxlan: allow setting ipv6 traffic class · 1400615d
      Daniel Borkmann authored
      We can already do that for IPv4, but IPv6 support was missing. Add
      it for vxlan, so it can be used with collect metadata frontends.
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      1400615d
    • Daniel Borkmann's avatar
      bpf, vxlan, geneve, gre: fix usage of dst_cache on xmit · db3c6139
      Daniel Borkmann authored
      The assumptions from commit 0c1d70af ("net: use dst_cache for vxlan
      device"), 468dfffc ("geneve: add dst caching support") and 3c1cb4d2
      ("net/ipv4: add dst cache support for gre lwtunnels") on dst_cache usage
      when ip_tunnel_info is used is unfortunately not always valid as assumed.
      
      While it seems correct for ip_tunnel_info front-ends such as OVS, eBPF
      however can fill in ip_tunnel_info for consumers like vxlan, geneve or gre
      with different remote dsts, tos, etc, therefore they cannot be assumed as
      packet independent.
      
      Right now vxlan, geneve, gre would cache the dst for eBPF and every packet
      would reuse the same entry that was first created on the initial route
      lookup. eBPF doesn't store/cache the ip_tunnel_info, so each skb may have
      a different one.
      
      Fix it by adding a flag that checks the ip_tunnel_info. Also the !tos test
      in vxlan needs to be handeled differently in this context as it is currently
      inferred from ip_tunnel_info as well if present. ip_tunnel_dst_cache_usable()
      helper is added for the three tunnel cases, which checks if we can use dst
      cache.
      
      Fixes: 0c1d70af ("net: use dst_cache for vxlan device")
      Fixes: 468dfffc ("geneve: add dst caching support")
      Fixes: 3c1cb4d2 ("net/ipv4: add dst cache support for gre lwtunnels")
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: default avatarPaolo Abeni <pabeni@redhat.com>
      Acked-by: default avatarHannes Frederic Sowa <hannes@stressinduktion.org>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      db3c6139