• Benjamin Poirier's avatar
    net: Do not enable tx-nocache-copy by default · cdb3f4a3
    Benjamin Poirier authored
    There are many cases where this feature does not improve performance or even
    reduces it.
    
    For example, here are the results from tests that I've run using 3.12.6 on one
    Intel Xeon W3565 and one i7 920 connected by ixgbe adapters. The results are
    from the Xeon, but they're similar on the i7. All numbers report the
    mean±stddev over 10 runs of 10s.
    
    1) latency tests similar to what is described in "c6e1a0d1 net: Allow no-cache
    copy from user on transmit"
    There is no statistically significant difference between tx-nocache-copy
    on/off.
    nic irqs spread out (one queue per cpu)
    
    200x netperf -r 1400,1
    tx-nocache-copy off
            692000±1000 tps
            50/90/95/99% latency (us): 275±2/643.8±0.4/799±1/2474.4±0.3
    tx-nocache-copy on
            693000±1000 tps
            50/90/95/99% latency (us): 274±1/644.1±0.7/800±2/2474.5±0.7
    
    200x netperf -r 14000,14000
    tx-nocache-copy off
            86450±80 tps
            50/90/95/99% latency (us): 334.37±0.02/838±1/2100±20/3990±40
    tx-nocache-copy on
            86110±60 tps
            50/90/95/99% latency (us): 334.28±0.01/837±2/2110±20/3990±20
    
    2) single stream throughput tests
    tx-nocache-copy leads to higher service demand
    
                            throughput  cpu0        cpu1        demand
                            (Gb/s)      (Gcycle)    (Gcycle)    (cycle/B)
    
    nic irqs and netperf on cpu0 (1x netperf -T0,0 -t omni -- -d send)
    
    tx-nocache-copy off     9402±5      9.4±0.2                 0.80±0.01
    tx-nocache-copy on      9403±3      9.85±0.04               0.838±0.004
    
    nic irqs on cpu0, netperf on cpu1 (1x netperf -T1,1 -t omni -- -d send)
    
    tx-nocache-copy off     9401±5      5.83±0.03   5.0±0.1     0.923±0.007
    tx-nocache-copy on      9404±2      5.74±0.03   5.523±0.009 0.958±0.002
    
    As a second example, here are some results from Eric Dumazet with latest
    net-next.
    tx-nocache-copy also leads to higher service demand
    
    (cpu is Intel(R) Xeon(R) CPU X5660  @ 2.80GHz)
    
    lpq83:~# ./ethtool -K eth0 tx-nocache-copy on
    lpq83:~# perf stat ./netperf -H lpq84 -c
    MIGRATED TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to lpq84.prod.google.com () port 0 AF_INET
    Recv   Send    Send                          Utilization       Service Demand
    Socket Socket  Message  Elapsed              Send     Recv     Send    Recv
    Size   Size    Size     Time     Throughput  local    remote   local   remote
    bytes  bytes   bytes    secs.    10^6bits/s  % S      % U      us/KB   us/KB
    
     87380  16384  16384    10.00      9407.44   2.50     -1.00    0.522   -1.000
    
     Performance counter stats for './netperf -H lpq84 -c':
    
           4282.648396 task-clock                #    0.423 CPUs utilized
                 9,348 context-switches          #    0.002 M/sec
                    88 CPU-migrations            #    0.021 K/sec
                   355 page-faults               #    0.083 K/sec
        11,812,797,651 cycles                    #    2.758 GHz                     [82.79%]
         9,020,522,817 stalled-cycles-frontend   #   76.36% frontend cycles idle    [82.54%]
         4,579,889,681 stalled-cycles-backend    #   38.77% backend  cycles idle    [67.33%]
         6,053,172,792 instructions              #    0.51  insns per cycle
                                                 #    1.49  stalled cycles per insn [83.64%]
           597,275,583 branches                  #  139.464 M/sec                   [83.70%]
             8,960,541 branch-misses             #    1.50% of all branches         [83.65%]
    
          10.128990264 seconds time elapsed
    
    lpq83:~# ./ethtool -K eth0 tx-nocache-copy off
    lpq83:~# perf stat ./netperf -H lpq84 -c
    MIGRATED TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to lpq84.prod.google.com () port 0 AF_INET
    Recv   Send    Send                          Utilization       Service Demand
    Socket Socket  Message  Elapsed              Send     Recv     Send    Recv
    Size   Size    Size     Time     Throughput  local    remote   local   remote
    bytes  bytes   bytes    secs.    10^6bits/s  % S      % U      us/KB   us/KB
    
     87380  16384  16384    10.00      9412.45   2.15     -1.00    0.449   -1.000
    
     Performance counter stats for './netperf -H lpq84 -c':
    
           2847.375441 task-clock                #    0.281 CPUs utilized
                11,632 context-switches          #    0.004 M/sec
                    49 CPU-migrations            #    0.017 K/sec
                   354 page-faults               #    0.124 K/sec
         7,646,889,749 cycles                    #    2.686 GHz                     [83.34%]
         6,115,050,032 stalled-cycles-frontend   #   79.97% frontend cycles idle    [83.31%]
         1,726,460,071 stalled-cycles-backend    #   22.58% backend  cycles idle    [66.55%]
         2,079,702,453 instructions              #    0.27  insns per cycle
                                                 #    2.94  stalled cycles per insn [83.22%]
           363,773,213 branches                  #  127.757 M/sec                   [83.29%]
             4,242,732 branch-misses             #    1.17% of all branches         [83.51%]
    
          10.128449949 seconds time elapsed
    
    CC: Tom Herbert <therbert@google.com>
    Signed-off-by: default avatarBenjamin Poirier <bpoirier@suse.de>
    Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
    cdb3f4a3
dev.c 172 KB