1. 08 Dec, 2017 11 commits
    • John Fastabend's avatar
      net: sched: use skb list for skb_bad_tx · 70e57d5e
      John Fastabend authored
      Similar to how gso is handled use skb list for skb_bad_tx this is
      required with lockless qdiscs because we may have multiple cores
      attempting to push skbs into skb_bad_tx concurrently
      Signed-off-by: default avatarJohn Fastabend <john.fastabend@gmail.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      70e57d5e
    • John Fastabend's avatar
      net: sched: drop qdisc_reset from dev_graft_qdisc · 7bbde83b
      John Fastabend authored
      In qdisc_graft_qdisc a "new" qdisc is attached and the 'qdisc_destroy'
      operation is called on the old qdisc. The destroy operation will wait
      a rcu grace period and call qdisc_rcu_free(). At which point
      gso_cpu_skb is free'd along with all stats so no need to zero stats
      and gso_cpu_skb from the graft operation itself.
      
      Further after dropping the qdisc locks we can not continue to call
      qdisc_reset before waiting an rcu grace period so that the qdisc is
      detached from all cpus. By removing the qdisc_reset() here we get
      the correct property of waiting an rcu grace period and letting the
      qdisc_destroy operation clean up the qdisc correctly.
      
      Note, a refcnt greater than 1 would cause the destroy operation to
      be aborted however if this ever happened the reference to the qdisc
      would be lost and we would have a memory leak.
      Signed-off-by: default avatarJohn Fastabend <john.fastabend@gmail.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      7bbde83b
    • John Fastabend's avatar
      net: sched: explicit locking in gso_cpu fallback · a53851e2
      John Fastabend authored
      This work is preparing the qdisc layer to support egress lockless
      qdiscs. If we are running the egress qdisc lockless in the case we
      overrun the netdev, for whatever reason, the netdev returns a busy
      error code and the skb is parked on the gso_skb pointer. With many
      cores all hitting this case at once its possible to have multiple
      sk_buffs here so we turn gso_skb into a queue.
      
      This should be the edge case and if we see this frequently then
      the netdev/qdisc layer needs to back off.
      Signed-off-by: default avatarJohn Fastabend <john.fastabend@gmail.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      a53851e2
    • John Fastabend's avatar
      net: sched: a dflt qdisc may be used with per cpu stats · d59f5ffa
      John Fastabend authored
      Enable dflt qdisc support for per cpu stats before this patch a dflt
      qdisc was required to use the global statistics qstats and bstats.
      
      This adds a static flags field to qdisc_ops that is propagated
      into qdisc->flags in qdisc allocate call. This allows the allocation
      block to completely allocate the qdisc object so we don't have
      dangling allocations after qdisc init.
      Signed-off-by: default avatarJohn Fastabend <john.fastabend@gmail.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      d59f5ffa
    • John Fastabend's avatar
      net: sched: provide per cpu qstat helpers · 40bd0362
      John Fastabend authored
      The per cpu qstats support was added with per cpu bstat support which
      is currently used by the ingress qdisc. This patch adds a set of
      helpers needed to make other qdiscs that use qstats per cpu as well.
      Signed-off-by: default avatarJohn Fastabend <john.fastabend@gmail.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      40bd0362
    • John Fastabend's avatar
      net: sched: remove remaining uses for qdisc_qlen in xmit path · 29b86cda
      John Fastabend authored
      sch_direct_xmit() uses qdisc_qlen as a return value but all call sites
      of the routine only check if it is zero or not. Simplify the logic so
      that we don't need to return an actual queue length value.
      
      This introduces a case now where sch_direct_xmit would have returned
      a qlen of zero but now it returns true. However in this case all
      call sites of sch_direct_xmit will implement a dequeue() and get
      a null skb and abort. This trades tracking qlen in the hotpath for
      an extra dequeue operation. Overall this seems to be good for
      performance.
      Signed-off-by: default avatarJohn Fastabend <john.fastabend@gmail.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      29b86cda
    • John Fastabend's avatar
      net: sched: allow qdiscs to handle locking · 6b3ba914
      John Fastabend authored
      This patch adds a flag for queueing disciplines to indicate the stack
      does not need to use the qdisc lock to protect operations. This can
      be used to build lockless scheduling algorithms and improving
      performance.
      
      The flag is checked in the tx path and the qdisc lock is only taken
      if it is not set. For now use a conditional if statement. Later we
      could be more aggressive if it proves worthwhile and use a static key
      or wrap this in a likely().
      
      Also the lockless case drops the TCQ_F_CAN_BYPASS logic. The reason
      for this is synchronizing a qlen counter across threads proves to
      cost more than doing the enqueue/dequeue operations when tested with
      pktgen.
      Signed-off-by: default avatarJohn Fastabend <john.fastabend@gmail.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      6b3ba914
    • John Fastabend's avatar
      net: sched: cleanup qdisc_run and __qdisc_run semantics · 6c148184
      John Fastabend authored
      Currently __qdisc_run calls qdisc_run_end() but does not call
      qdisc_run_begin(). This makes it hard to track pairs of
      qdisc_run_{begin,end} across function calls.
      
      To simplify reading these code paths this patch moves begin/end calls
      into qdisc_run().
      Signed-off-by: default avatarJohn Fastabend <john.fastabend@gmail.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      6c148184
    • Toshiaki Makita's avatar
      virtio_net: Disable interrupts if napi_complete_done rescheduled napi · fdaa767a
      Toshiaki Makita authored
      Since commit 39e6c820 ("net: solve a NAPI race") napi has been able
      to be rescheduled within napi_complete_done() even in non-busypoll case,
      but virtnet_poll() always enabled interrupts before complete, and when
      napi was rescheduled within napi_complete_done() it did not disable
      interrupts.
      This caused more interrupts when event idx is disabled.
      
      According to commit cbdadbbf ("virtio_net: fix race in RX VQ
      processing") we cannot place virtqueue_enable_cb_prepare() after
      NAPI_STATE_SCHED is cleared, so disable interrupts again if
      napi_complete_done() returned false.
      
      Tested with vhost-user of OVS 2.7 on host, which does not have the event
      idx feature.
      
      * Before patch:
      
      $ netperf -t UDP_STREAM -H 192.168.150.253 -l 60 -- -m 1472
      MIGRATED UDP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 192.168.150.253 () port 0 AF_INET
      Socket  Message  Elapsed      Messages
      Size    Size     Time         Okay Errors   Throughput
      bytes   bytes    secs            #      #   10^6bits/sec
      
      212992    1472   60.00     32763206      0    6430.32
      212992           60.00     23384299           4589.56
      
      Interrupts on guest: 9872369
      Packets/interrupt:   2.37
      
      * After patch
      
      $ netperf -t UDP_STREAM -H 192.168.150.253 -l 60 -- -m 1472
      MIGRATED UDP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 192.168.150.253 () port 0 AF_INET
      Socket  Message  Elapsed      Messages
      Size    Size     Time         Okay Errors   Throughput
      bytes   bytes    secs            #      #   10^6bits/sec
      
      212992    1472   60.00     32794646      0    6436.49
      212992           60.00     32793501           6436.27
      
      Interrupts on guest: 4941299
      Packets/interrupt:   6.64
      Signed-off-by: default avatarToshiaki Makita <makita.toshiaki@lab.ntt.co.jp>
      Acked-by: default avatarMichael S. Tsirkin <mst@redhat.com>
      Acked-by: default avatarJason Wang <jasowang@redhat.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      fdaa767a
    • David S. Miller's avatar
      Merge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next · 62cd2770
      David S. Miller authored
      Alexei Starovoitov says:
      
      ====================
      pull-request: bpf-next 2017-12-07
      
      The following pull-request contains BPF updates for your net-next tree.
      
      The main changes are:
      
      1) Detailed documentation of BPF development process from Daniel.
      
      2) Addition of is_fullsock, snd_cwnd and srtt_us fields to bpf_sock_ops
         from Lawrence.
      
      3) Minor follow up for bpf_skb_set_tunnel_key() from William.
      ====================
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      62cd2770
    • Jason Wang's avatar
      tuntap: fix possible deadlock when fail to register netdev · 124da8f6
      Jason Wang authored
      Private destructor could be called when register_netdev() fail with
      rtnl lock held. This will lead deadlock in tun_free_netdev() who tries
      to hold rtnl_lock. Fixing this by switching to use spinlock to
      synchronize.
      
      Fixes: 96f84061 ("tun: add eBPF based queue selection method")
      Reported-by: default avatarEric Dumazet <eric.dumazet@gmail.com>
      Cc: Eric Dumazet <eric.dumazet@gmail.com>
      Cc: Willem de Bruijn <willemb@google.com>
      Signed-off-by: default avatarJason Wang <jasowang@redhat.com>
      Acked-by: default avatarMichael S. Tsirkin <mst@redhat.com>
      Reviewed-by: default avatarEric Dumazet <edumazet@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      124da8f6
  2. 07 Dec, 2017 9 commits
  3. 06 Dec, 2017 17 commits
  4. 05 Dec, 2017 3 commits
    • Cong Wang's avatar
      net_sched: remove unused parameter from act cleanup ops · 9a63b255
      Cong Wang authored
      No one actually uses it.
      
      Cc: Jiri Pirko <jiri@mellanox.com>
      Cc: Jamal Hadi Salim <jhs@mojatatu.com>
      Signed-off-by: default avatarCong Wang <xiyou.wangcong@gmail.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      9a63b255
    • David S. Miller's avatar
      Merge branch 'dsa-use-per-port-upstream-port' · 8bf54381
      David S. Miller authored
      Vivien Didelot says:
      
      ====================
      net: dsa: use per-port upstream port
      
      An upstream port is a local switch port used to reach a CPU port.
      
      DSA still considers a unique CPU port in the whole switch fabric and
      thus return a unique upstream port for a given switch. This is wrong in
      a multiple CPU ports environment.
      
      We are now switching to using the dedicated CPU port assigned to each
      port in order to get rid of the deprecated unique tree CPU port.
      
      This patchset makes the dsa_upstream_port() helper take a port argument
      and goes one step closer complete support for multiple CPU ports.
      
      Changes in v2:
        - reverse-christmas-tree-fy variables
      ====================
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      8bf54381
    • Vivien Didelot's avatar
      net: dsa: return per-port upstream port · 07073c79
      Vivien Didelot authored
      The current dsa_upstream_port() helper still assumes a unique CPU port
      in the whole switch fabric. This is becoming wrong, as every port in the
      fabric has its dedicated CPU port, thus every port has an upstream port.
      
      Add a port argument to the dsa_upstream_port() helper and fetch its CPU
      port instead of the deprecated unique fabric CPU port. A CPU or unused
      port has no dedicated CPU port, so return itself in this case.
      
      At the same time, change the return value from u8 to unsigned int since
      there is no need to limit the size here.
      Signed-off-by: default avatarVivien Didelot <vivien.didelot@savoirfairelinux.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      07073c79