1. 17 Mar, 2017 19 commits
    • Alexei Starovoitov's avatar
      samples/bpf: add map_lookup microbenchmark · 95ff141e
      Alexei Starovoitov authored
      $ map_perf_test 128
      speed of HASH bpf_map_lookup_elem() in lookups per second
      	w/o JIT		w/JIT
      before	46M		58M
      after	42M		74M
      
      perf report
      before:
          54.23%  map_perf_test  [kernel.kallsyms]  [k] __htab_map_lookup_elem
          14.24%  map_perf_test  [kernel.kallsyms]  [k] lookup_elem_raw
           8.84%  map_perf_test  [kernel.kallsyms]  [k] htab_map_lookup_elem
           5.93%  map_perf_test  [kernel.kallsyms]  [k] bpf_map_lookup_elem
           2.30%  map_perf_test  [kernel.kallsyms]  [k] bpf_prog_da4fc6a3f41761a2
           1.49%  map_perf_test  [kernel.kallsyms]  [k] kprobe_ftrace_handler
      
      after:
          60.03%  map_perf_test  [kernel.kallsyms]  [k] __htab_map_lookup_elem
          18.07%  map_perf_test  [kernel.kallsyms]  [k] lookup_elem_raw
           2.91%  map_perf_test  [kernel.kallsyms]  [k] bpf_prog_da4fc6a3f41761a2
           1.94%  map_perf_test  [kernel.kallsyms]  [k] _einittext
           1.90%  map_perf_test  [kernel.kallsyms]  [k] __audit_syscall_exit
           1.72%  map_perf_test  [kernel.kallsyms]  [k] kprobe_ftrace_handler
      
      Notice that bpf_map_lookup_elem() and htab_map_lookup_elem() are trivial
      functions, yet they take sizeable amount of cpu time.
      htab_map_gen_lookup() removes bpf_map_lookup_elem() and converts
      htab_map_lookup_elem() into three BPF insns which causing cpu time
      for bpf_prog_da4fc6a3f41761a2() slightly increase.
      
      $ map_perf_test 256
      speed of ARRAY bpf_map_lookup_elem() in lookups per second
      	w/o JIT		w/JIT
      before	97M		174M
      after	64M		280M
      
      before:
          37.33%  map_perf_test  [kernel.kallsyms]  [k] array_map_lookup_elem
          13.95%  map_perf_test  [kernel.kallsyms]  [k] bpf_map_lookup_elem
           6.54%  map_perf_test  [kernel.kallsyms]  [k] bpf_prog_da4fc6a3f41761a2
           4.57%  map_perf_test  [kernel.kallsyms]  [k] kprobe_ftrace_handler
      
      after:
          32.86%  map_perf_test  [kernel.kallsyms]  [k] bpf_prog_da4fc6a3f41761a2
           6.54%  map_perf_test  [kernel.kallsyms]  [k] kprobe_ftrace_handler
      
      array_map_gen_lookup() removes calls to array_map_lookup_elem()
      and bpf_map_lookup_elem() and replaces them with 7 bpf insns.
      
      The performance without JIT is slower, since executing extra insns
      in the interpreter is slower than running native C code,
      but with JIT the performance gains are obvious,
      since native C->x86 code is replaced with fewer bpf->x86 instructions.
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      Acked-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      95ff141e
    • Alexei Starovoitov's avatar
      bpf: inline htab_map_lookup_elem() · 9015d2f5
      Alexei Starovoitov authored
      Optimize:
      bpf_call
        bpf_map_lookup_elem
          map->ops->map_lookup_elem
            htab_map_lookup_elem
              __htab_map_lookup_elem
      into:
      bpf_call
        __htab_map_lookup_elem
      
      to improve performance of JITed programs.
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      Acked-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      9015d2f5
    • Alexei Starovoitov's avatar
      bpf: add helper inlining infra and optimize map_array lookup · 81ed18ab
      Alexei Starovoitov authored
      Optimize bpf_call -> bpf_map_lookup_elem() -> array_map_lookup_elem()
      into a sequence of bpf instructions.
      When JIT is on the sequence of bpf instructions is the sequence
      of native cpu instructions with significantly faster performance
      than indirect call and two function's prologue/epilogue.
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      Acked-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      81ed18ab
    • Alexei Starovoitov's avatar
      bpf: adjust insn_aux_data when patching insns · 8041902d
      Alexei Starovoitov authored
      convert_ctx_accesses() replaces single bpf instruction with a set of
      instructions. Adjust corresponding insn_aux_data while patching.
      It's needed to make sure subsequent 'for(all insn)' loops
      have matching insn and insn_aux_data.
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      Acked-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      8041902d
    • Alexei Starovoitov's avatar
      bpf: refactor fixup_bpf_calls() · 79741b3b
      Alexei Starovoitov authored
      reduce indent and make it iterate over instructions similar to
      convert_ctx_accesses(). Also convert hard BUG_ON into soft verifier error.
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      Acked-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      79741b3b
    • Alexei Starovoitov's avatar
      bpf: move fixup_bpf_calls() function · e245c5c6
      Alexei Starovoitov authored
      no functional change.
      move fixup_bpf_calls() to verifier.c
      it's being refactored in the next patch
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      Acked-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      e245c5c6
    • Soheil Hassas Yeganeh's avatar
      tcp: remove tcp_tw_recycle · 4396e461
      Soheil Hassas Yeganeh authored
      The tcp_tw_recycle was already broken for connections
      behind NAT, since the per-destination timestamp is not
      monotonically increasing for multiple machines behind
      a single destination address.
      
      After the randomization of TCP timestamp offsets
      in commit 8a5bd45f6616 (tcp: randomize tcp timestamp offsets
      for each connection), the tcp_tw_recycle is broken for all
      types of connections for the same reason: the timestamps
      received from a single machine is not monotonically increasing,
      anymore.
      
      Remove tcp_tw_recycle, since it is not functional. Also, remove
      the PAWSPassive SNMP counter since it is only used for
      tcp_tw_recycle, and simplify tcp_v4_route_req and tcp_v6_route_req
      since the strict argument is only set when tcp_tw_recycle is
      enabled.
      Signed-off-by: default avatarSoheil Hassas Yeganeh <soheil@google.com>
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Signed-off-by: default avatarNeal Cardwell <ncardwell@google.com>
      Signed-off-by: default avatarYuchung Cheng <ycheng@google.com>
      Cc: Lutz Vieweg <lvml@5t9.de>
      Cc: Florian Westphal <fw@strlen.de>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      4396e461
    • Soheil Hassas Yeganeh's avatar
      tcp: remove per-destination timestamp cache · d82bae12
      Soheil Hassas Yeganeh authored
      Commit 8a5bd45f6616 (tcp: randomize tcp timestamp offsets for each connection)
      randomizes TCP timestamps per connection. After this commit,
      there is no guarantee that the timestamps received from the
      same destination are monotonically increasing. As a result,
      the per-destination timestamp cache in TCP metrics (i.e., tcpm_ts
      in struct tcp_metrics_block) is broken and cannot be relied upon.
      
      Remove the per-destination timestamp cache and all related code
      paths.
      
      Note that this cache was already broken for caching timestamps of
      multiple machines behind a NAT sharing the same address.
      Signed-off-by: default avatarSoheil Hassas Yeganeh <soheil@google.com>
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Signed-off-by: default avatarNeal Cardwell <ncardwell@google.com>
      Signed-off-by: default avatarYuchung Cheng <ycheng@google.com>
      Cc: Lutz Vieweg <lvml@5t9.de>
      Cc: Florian Westphal <fw@strlen.de>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      d82bae12
    • David S. Miller's avatar
      Merge branch 'sunvnet-better-connection-management' · 8b705f52
      David S. Miller authored
      Shannon Nelson says:
      
      ====================
      sunvnet: better connection management
      
      These patches remove some problems in handling of carrier state
      with the ldmvsw vswitch, remove  an xoff misuse in sunvnet, and
      add stats for debug and tracking of point-to-point connections
      between the ldom VMs.
      
      v2:
       - added ldmvsw ndo_open to reset the LDC channel
       - updated copyrights
      ====================
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      8b705f52
    • Shannon Nelson's avatar
      sunvnet: xoff not needed when removing port link · 9c5a3a1f
      Shannon Nelson authored
      The sunvnet netdev is connected to the controlling ldom's vswitch
      for network bridging.  However, for higher performance between ldoms,
      there also is a channel between each client ldom.  These connections are
      represented in the sunvnet driver by a queue for each ldom.  The driver
      uses select_queue to tell the stack which queue to use by tracking the mac
      addresses on the other end of each port.  When a connected ldom shuts down,
      the driver receives an LDC_EVENT_RESET and the port is removed from the
      driver, thus a queue with no ldom on the other end will never be selected
      for Tx.
      
      The driver was trying to reinforce the "don't use this queue" notion with
      netif_tx_stop_queue() and netif_tx_wake_queue(), which really should only
      be used to signal a Tx queue is full (aka XOFF).  This misuse of queue
      state resulted in NETDEV WATCHDOG messages and lots of unnecessary calls
      into the driver's tx_timeout handler.  Simply removing these takes care
      of the problem.
      
      Orabug: 25190537
      Signed-off-by: default avatarShannon Nelson <shannon.nelson@oracle.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      9c5a3a1f
    • Shannon Nelson's avatar
      sunvnet: count multicast packets · b12a96f5
      Shannon Nelson authored
      Make sure multicast packets get counted in the device.
      
      Orabug: 25190537
      Signed-off-by: default avatarShannon Nelson <shannon.nelson@oracle.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      b12a96f5
    • Shannon Nelson's avatar
      sunvnet: track port queues correctly · e1f1e5f7
      Shannon Nelson authored
      Track our used and unused queue indexies correctly.  Otherwise, as ports
      dropped out and returned, they all eventually ended up with the same
      queue index.
      
      Orabug: 25190537
      Signed-off-by: default avatarShannon Nelson <shannon.nelson@oracle.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      e1f1e5f7
    • Shannon Nelson's avatar
      sunvnet: add stats to track ldom to ldom packets and bytes · 0f512c84
      Shannon Nelson authored
      In this driver, there is a "port" created for the connection to each of
      the other ldoms; a netdev queue is mapped to each port, and they are
      collected under a single netdev.  The generic netdev statistics show
      us all the traffic in and out of our network device, but don't show
      individual queue/port stats.  This patch breaks out the traffic counts
      for the individual ports and gives us a little view into the state of
      those connections.
      
      Orabug: 25190537
      Signed-off-by: default avatarShannon Nelson <shannon.nelson@oracle.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      0f512c84
    • Shannon Nelson's avatar
      ldmvsw: better use of link up and down on ldom vswitch · 867fa150
      Shannon Nelson authored
      When an ldom VM is bound, the network vswitch infrastructure is set up for
      it, but was being forced 'UP' by the userland switch configuration script.
      When 'UP' but not actually connected to a running VM, the ipv6 neighbor
      probes fail (not a horrible thing) and start cluttering up the kernel logs.
      Funny thing: these are debug messages that never actually show up, but
      we do see the net_ratelimited messages that say N callbacks were
      suppressed.
      
      This patch defers the netif_carrier_on() until an actual link has been
      established with the VM, as indicated by receiving an LDC_EVENT_UP from
      the underlying LDC protocol.  Similarly, we take the link down when we
      see the LDC_EVENT_RESET.  Now when we see the ndo_open(), we reset the
      link to get things talking again.
      
      Orabug: 25525312
      Signed-off-by: default avatarShannon Nelson <shannon.nelson@oracle.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      867fa150
    • Jarod Wilson's avatar
      bonding: add 802.3ad support for 25G speeds · 19ddde1e
      Jarod Wilson authored
      Cut-n-paste enablement of 802.3ad bonding on 25G NICs, which currently
      report 0 as their bandwidth.
      
      CC: Jay Vosburgh <j.vosburgh@gmail.com>
      CC: Veaceslav Falico <vfalico@gmail.com>
      CC: Andy Gospodarek <andy@greyhouse.net>
      CC: netdev@vger.kernel.org
      Signed-off-by: default avatarJarod Wilson <jarod@redhat.com>
      Acked-by: default avatarAndy Gospodarek <andy@greyhouse.net>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      19ddde1e
    • chun Long's avatar
      tcp_westwood: fix tcp_westwood_info() style mistakes · be7164cd
      chun Long authored
      replace comma to semi colons in tcp_westwood_info().
      Acked-by: default avatarStephen Hemminger <stephen@networkplumber.org>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      be7164cd
    • Rick Farrington's avatar
      liquidio: use meaningful names for IRQs · 0c88a761
      Rick Farrington authored
      All IRQs owned by the PF and VF drivers share the same nondescript name
      "octeon"; this makes it difficult to setup interrupt affinity.
      
      Change the IRQ names to reflect their specific purpose:
      
          LiquidIO<id>-<func>-<type>-<queue pair num>
      
      Examples:
          LiquidIO0-pf0-rxtx-3
          LiquidIO1-vf1-rxtx-0
          LiquidIO0-pf0-aux
      
      We cannot use netdev->name for naming the IRQs because:
      
          1.  Early during init, the PF and VF drivers require interrupts to
              send/receive control data from the NIC firmware; so the PF and VF
              must request IRQs long before the netdev struct is registered.
      
          2.  The IRQ name can only be specified at the time it is requested.
              It cannot be changed after that.
      Signed-off-by: default avatarRick Farrington <ricardo.farrington@cavium.com>
      Signed-off-by: default avatarFelix Manlunas <felix.manlunas@cavium.com>
      Signed-off-by: default avatarSatanand Burla <satananda.burla@cavium.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      0c88a761
    • Rick Farrington's avatar
      liquidio: remove/replace invalid code · b229487b
      Rick Farrington authored
      Remove invalid call to dma_sync_single_for_cpu() because previous DMA
      allocation was coherent--not streaming.  Remove code that references fields
      in struct list_head; replace it with calls to list_empty() and
      list_first_entry().  Also, add comment to clarify complicated if statement.
      Signed-off-by: default avatarRick Farrington <ricardo.farrington@cavium.com>
      Signed-off-by: default avatarFelix Manlunas <felix.manlunas@cavium.com>
      Signed-off-by: default avatarDerek Chickles <derek.chickles@cavium.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      b229487b
    • Nik Unger's avatar
      netem: apply correct delay when rate throttling · 5080f39e
      Nik Unger authored
      I recently reported on the netem list that iperf network benchmarks
      show unexpected results when a bandwidth throttling rate has been
      configured for netem. Specifically:
      
      1) The measured link bandwidth *increases* when a higher delay is added
      2) The measured link bandwidth appears higher than the specified limit
      3) The measured link bandwidth for the same very slow settings varies significantly across
        machines
      
      The issue can be reproduced by using tc to configure netem with a
      512kbit rate and various (none, 1us, 50ms, 100ms, 200ms) delays on a
      veth pair between network namespaces, and then using iperf (or any
      other network benchmarking tool) to test throughput. Complete detailed
      instructions are in the original email chain here:
      https://lists.linuxfoundation.org/pipermail/netem/2017-February/001672.html
      
      There appear to be two underlying bugs causing these effects:
      
      - The first issue causes long delays when the rate is slow and no
        delay is configured (e.g., "rate 512kbit"). This is because SKBs are
        not orphaned when no delay is configured, so orphaning does not
        occur until *after* the rate-induced delay has been applied. For
        this reason, adding a tiny delay (e.g., "rate 512kbit delay 1us")
        dramatically increases the measured bandwidth.
      
      - The second issue is that rate-induced delays are not correctly
        applied, allowing SKB delays to occur in parallel. The indended
        approach is to compute the delay for an SKB and to add this delay to
        the end of the current queue. However, the code does not detect
        existing SKBs in the queue due to improperly testing sch->q.qlen,
        which is nonzero even when packets exist only in the
        rbtree. Consequently, new SKBs do not wait for the current queue to
        empty. When packet delays vary significantly (e.g., if packet sizes
        are different), then this also causes unintended reordering.
      
      I modified the code to expect a delay (and orphan the SKB) when a rate
      is configured. I also added some defensive tests that correctly find
      the latest scheduled delivery time, even if it is (unexpectedly) for a
      packet in sch->q. I have tested these changes on the latest kernel
      (4.11.0-rc1+) and the iperf / ping test results are as expected.
      Signed-off-by: default avatarNik Unger <njunger@uwaterloo.ca>
      Signed-off-by: default avatarStephen Hemminger <stephen@networkplumber.org>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      5080f39e
  2. 16 Mar, 2017 21 commits