1. 25 Apr, 2017 7 commits
    • David S. Miller's avatar
      Merge branch 'virtio-net-tx-napi' · 86a5df14
      David S. Miller authored
      Willem de Bruijn says:
      
      ====================
      virtio-net tx napi
      
      Add napi for virtio-net transmit completion processing.
      
      Changes:
        v2 -> v3:
          - convert __netif_tx_trylock to __netif_tx_lock on tx napi poll
                ensure that the handler always cleans, to avoid deadlock
          - unconditionally clean in start_xmit
                avoid adding an unnecessary "if (use_napi)" branch
          - remove virtqueue_disable_cb in patch 5/5
                a noop in the common event_idx based loop
          - document affinity_hint_set constraint
      
        v1 -> v2:
          - disable by default
          - disable unless affinity_hint_set
                because cache misses add up to a third higher cycle cost,
      	  e.g., in TCP_RR tests. This is not limited to the patch
      	  that enables tx completion cleaning in rx napi.
          - use trylock to avoid contention between tx and rx napi
          - keep interrupts masked during xmit_more (new patch 5/5)
                this improves cycles especially for multi UDP_STREAM, which
      	  does not benefit from cleaning tx completions on rx napi.
          - move free_old_xmit_skbs (new patch 3/5)
                to avoid forward declaration
      
          not changed:
          - deduplicate virnet_poll_tx and virtnet_poll_txclean
                they look similar, but have differ too much to make it
      	  worthwhile.
          - delay netif_wake_subqueue for more than 2 + MAX_SKB_FRAGS
                evaluated, but made no difference
          - patch 1/5
      
        RFC -> v1:
          - dropped vhost interrupt moderation patch:
                not needed and likely expensive at light load
          - remove tx napi weight
              - always clean all tx completions
              - use boolean to toggle tx-napi, instead
          - only clean tx in rx if tx-napi is enabled
              - then clean tx before rx
          - fix: add missing braces in virtnet_freeze_down
          - testing: add 4KB TCP_RR + UDP test results
      
      Based on previous patchsets by Jason Wang:
      
        [RFC V7 PATCH 0/7] enable tx interrupts for virtio-net
        http://lkml.iu.edu/hypermail/linux/kernel/1505.3/00245.html
      
      Before commit b0c39dbd ("virtio_net: don't free buffers in xmit
      ring") the virtio-net driver would free transmitted packets on
      transmission of new packets in ndo_start_xmit and, to catch the edge
      case when no new packet is sent, also in a timer at 10HZ.
      
      A timer can cause long stalls. VIRTIO_F_NOTIFY_ON_EMPTY avoids stalls
      due to low free descriptor count. It does not address a stalls due to
      low socket SO_SNDBUF. Increasing timer frequency decreases that stall
      time, but increases interrupt rate and, thus, cycle count.
      
      Currently, with no timer, packets are freed only at ndo_start_xmit.
      Latency of consume_skb is now unbounded. To avoid a deadlock if a sock
      reaches SO_SNDBUF, packets are orphaned on tx. This breaks TCP small
      queues.
      
      Reenable TCP small queues by removing the orphan. Instead of using a
      timer, convert the driver to regular tx napi. This does not have the
      unresolved stall issue and does not have any frequency to tune.
      
      By keeping interrupts enabled by default, napi increases tx
      interrupt rate. VIRTIO_F_EVENT_IDX avoids sending an interrupt if
      one is already unacknowledged, so makes this more feasible today.
      Combine that with an optimization that brings interrupt rate
      back in line with the existing version for most workloads:
      
      Tx completion cleaning on rx interrupts elides most explicit tx
      interrupts by relying on the fact that many rx interrupts fire.
      
      Tested by running {1, 10, 100} {TCP, UDP} STREAM, RR, 4K_RR benchmarks
      from a guest to a server on the host, on an x86_64 Haswell. The guest
      runs 4 vCPUs pinned to 4 cores. vhost and the test server are
      pinned to a core each.
      
      All results are the median of 5 runs, with variance well < 10%.
      Used neper (github.com/google/neper) as test process.
      
      Napi increases single stream throughput, but increases cycle cost.
      The optimizations bring this down. The previous patchset saw a
      regression with UDP_STREAM, which does not benefit from cleaning tx
      interrupts in rx napi. This regression is now gone for 10x, 100x.
      Remaining difference is higher 1x TCP_STREAM, lower 1x UDP_STREAM.
      
      The latest results are with process, rx napi and tx napi affine to
      the same core. All numbers are lower than the previous patchset.
      
                   upstream     napi
      TCP_STREAM:
      1x:
        Mbps          27816    39805
        Gcycles         274      285
      
      10x:
        Mbps          42947    42531
        Gcycles         300      296
      
      100x:
        Mbps          31830    28042
        Gcycles         279      269
      
      TCP_RR Latency (us):
      1x:
        p50              21       21
        p99              27       27
        Gcycles         180      167
      
      10x:
        p50              40       39
        p99              52       52
        Gcycles         214      211
      
      100x:
        p50             281      241
        p99             411      337
        Gcycles         218      226
      
      TCP_RR 4K:
      1x:
        p50              28       29
        p99              34       36
        Gcycles         177      167
      
      10x:
        p50              70       71
        p99              85      134
        Gcycles         213      214
      
      100x:
        p50             442      611
        p99             802      785
        Gcycles         237      216
      
      UDP_STREAM:
      1x:
        Mbps          29468    26800
        Gcycles         284      293
      
      10x:
        Mbps          29891    29978
        Gcycles         285      312
      
      100x:
        Mbps          30269    30304
        Gcycles         318      316
      
      UDP_RR:
      1x:
        p50              19       19
        p99              23       23
        Gcycles         180      173
      
      10x:
        p50              35       40
        p99              54       64
        Gcycles         245      237
      
      100x:
        p50             234      286
        p99             484      473
        Gcycles         224      214
      
      Note that GSO is enabled, so 4K RR still translates to one packet
      per request.
      
      Lower throughput at 100x vs 10x can be (at least in part)
      explained by looking at bytes per packet sent (nstat). It likely
      also explains the lower throughput of 1x for some variants.
      
      upstream:
      
       N=1   bytes/pkt=16581
       N=10  bytes/pkt=61513
       N=100 bytes/pkt=51558
      
      at_rx:
      
       N=1   bytes/pkt=65204
       N=10  bytes/pkt=65148
       N=100 bytes/pkt=56840
      ====================
      Acked-by: default avatarMichael S. Tsirkin <mst@redhat.com>
      86a5df14
    • Willem de Bruijn's avatar
      virtio-net: keep tx interrupts disabled unless kick · bdb12e0d
      Willem de Bruijn authored
      Tx napi mode increases the rate of transmit interrupts. Suppress some
      by masking interrupts while more packets are expected. The interrupts
      will be reenabled before the last packet is sent.
      
      This optimization reduces the througput drop with tx napi for
      unidirectional flows such as UDP_STREAM that do not benefit from
      cleaning tx completions in the the receive napi handler.
      Signed-off-by: default avatarWillem de Bruijn <willemb@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      bdb12e0d
    • Willem de Bruijn's avatar
      virtio-net: clean tx descriptors from rx napi · 7b0411ef
      Willem de Bruijn authored
      Amortize the cost of virtual interrupts by doing both rx and tx work
      on reception of a receive interrupt if tx napi is enabled. With
      VIRTIO_F_EVENT_IDX, this suppresses most explicit tx completion
      interrupts for bidirectional workloads.
      Signed-off-by: default avatarWillem de Bruijn <willemb@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      7b0411ef
    • Willem de Bruijn's avatar
      virtio-net: move free_old_xmit_skbs · ea7735d9
      Willem de Bruijn authored
      An upcoming patch will call free_old_xmit_skbs indirectly from
      virtnet_poll. Move the function above this to avoid having to
      introduce a forward declaration.
      
      This is a pure move: no code changes.
      Signed-off-by: default avatarWillem de Bruijn <willemb@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      ea7735d9
    • Willem de Bruijn's avatar
      virtio-net: transmit napi · b92f1e67
      Willem de Bruijn authored
      Convert virtio-net to a standard napi tx completion path. This enables
      better TCP pacing using TCP small queues and increases single stream
      throughput.
      
      The virtio-net driver currently cleans tx descriptors on transmission
      of new packets in ndo_start_xmit. Latency depends on new traffic, so
      is unbounded. To avoid deadlock when a socket reaches its snd limit,
      packets are orphaned on tranmission. This breaks socket backpressure,
      including TSQ.
      
      Napi increases the number of interrupts generated compared to the
      current model, which keeps interrupts disabled as long as the ring
      has enough free descriptors. Keep tx napi optional and disabled for
      now. Follow-on patches will reduce the interrupt cost.
      Signed-off-by: default avatarWillem de Bruijn <willemb@google.com>
      Signed-off-by: default avatarJason Wang <jasowang@redhat.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      b92f1e67
    • Willem de Bruijn's avatar
      virtio-net: napi helper functions · e4e8452a
      Willem de Bruijn authored
      Prepare virtio-net for tx napi by converting existing napi code to
      use helper functions. This also deduplicates some logic.
      Signed-off-by: default avatarWillem de Bruijn <willemb@google.com>
      Signed-off-by: default avatarJason Wang <jasowang@redhat.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      e4e8452a
    • David S. Miller's avatar
      sparc64: Improve 64-bit constant loading in eBPF JIT. · 14933dc8
      David S. Miller authored
      Doing a full 64-bit decomposition is really stupid especially for
      simple values like 0 and -1.
      
      But if we are going to optimize this, go all the way and try for all 2
      and 3 instruction sequences not requiring a temporary register as
      well.
      
      First we do the easy cases where it's a zero or sign extended 32-bit
      number (sethi+or, sethi+xor, respectively).
      
      Then we try to find a range of set bits we can load simply then shift
      up into place, in various ways.
      
      Then we try negating the constant and see if we can do a simple
      sequence using that with a xor at the end.  (f.e. the range of set
      bits can't be loaded simply, but for the negated value it can)
      
      The final optimized strategy involves 4 instructions sequences not
      needing a temporary register.
      
      Otherwise we sadly fully decompose using a temp..
      
      Example, from ALU64_XOR_K: 0x0000ffffffff0000 ^ 0x0 = 0x0000ffffffff0000:
      
      0000000000000000 <foo>:
         0:   9d e3 bf 50     save  %sp, -176, %sp
         4:   01 00 00 00     nop
         8:   90 10 00 18     mov  %i0, %o0
         c:   13 3f ff ff     sethi  %hi(0xfffffc00), %o1
        10:   92 12 63 ff     or  %o1, 0x3ff, %o1     ! ffffffff <foo+0xffffffff>
        14:   93 2a 70 10     sllx  %o1, 0x10, %o1
        18:   15 3f ff ff     sethi  %hi(0xfffffc00), %o2
        1c:   94 12 a3 ff     or  %o2, 0x3ff, %o2     ! ffffffff <foo+0xffffffff>
        20:   95 2a b0 10     sllx  %o2, 0x10, %o2
        24:   92 1a 60 00     xor  %o1, 0, %o1
        28:   12 e2 40 8a     cxbe  %o1, %o2, 38 <foo+0x38>
        2c:   9a 10 20 02     mov  2, %o5
        30:   10 60 00 03     b,pn   %xcc, 3c <foo+0x3c>
        34:   01 00 00 00     nop
        38:   9a 10 20 01     mov  1, %o5     ! 1 <foo+0x1>
        3c:   81 c7 e0 08     ret
        40:   91 eb 40 00     restore  %o5, %g0, %o0
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      14933dc8
  2. 24 Apr, 2017 33 commits