1. 29 Sep, 2014 1 commit
    • Eric Dumazet's avatar
      dql: dql_queued() should write first to reduce bus transactions · 3d9a0d2f
      Eric Dumazet authored
      While doing high throughput test on a BQL enabled NIC,
      I found a very high cost in ndo_start_xmit() when accessing BQL data.
      
      It turned out the problem was caused by compiler trying to be
      smart, but involving a bad MESI transaction :
      
        0.05 │  mov    0xc0(%rax),%edi    // LOAD dql->num_queued
        0.48 │  mov    %edx,0xc8(%rax)    // STORE dql->last_obj_cnt = count
       58.23 │  add    %edx,%edi
        0.58 │  cmp    %edi,0xc4(%rax)
        0.76 │  mov    %edi,0xc0(%rax)    // STORE dql->num_queued += count
        0.72 │  js     bd8
      
      I got an incredible 10 % gain [1] by making sure cpu do not attempt
      to get the cache line in Shared mode, but directly requests for
      ownership.
      
      New code :
      	mov    %edx,0xc8(%rax)  // STORE dql->last_obj_cnt = count
      	add    %edx,0xc0(%rax)  // RMW   dql->num_queued += count
      	mov    0xc4(%rax),%ecx  // LOAD dql->adj_limit
      	mov    0xc0(%rax),%edx  // LOAD dql->num_queued
      	cmp    %edx,%ecx
      
      The TX completion was running from another cpu, with high interrupts
      rate.
      
      Note that I am using barrier() as a soft hint, as mb() here could be
      too heavy cost.
      
      [1] This was a netperf TCP_STREAM with TSO disabled, but GSO enabled.
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Acked-by: default avatarJesper Dangaard Brouer <brouer@redhat.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      3d9a0d2f
  2. 28 Sep, 2014 32 commits
  3. 26 Sep, 2014 7 commits