1. 16 Jun, 2017 19 commits
    • Johannes Berg's avatar
      networking: make skb_pull & friends return void pointers · af72868b
      Johannes Berg authored
      It seems like a historic accident that these return unsigned char *,
      and in many places that means casts are required, more often than not.
      
      Make these functions return void * and remove all the casts across
      the tree, adding a (u8 *) cast only where the unsigned char pointer
      was used directly, all done with the following spatch:
      
          @@
          expression SKB, LEN;
          typedef u8;
          identifier fn = {
                  skb_pull,
                  __skb_pull,
                  skb_pull_inline,
                  __pskb_pull_tail,
                  __pskb_pull,
                  pskb_pull
          };
          @@
          - *(fn(SKB, LEN))
          + *(u8 *)fn(SKB, LEN)
      
          @@
          expression E, SKB, LEN;
          identifier fn = {
                  skb_pull,
                  __skb_pull,
                  skb_pull_inline,
                  __pskb_pull_tail,
                  __pskb_pull,
                  pskb_pull
          };
          type T;
          @@
          - E = ((T *)(fn(SKB, LEN)))
          + E = fn(SKB, LEN)
      Signed-off-by: default avatarJohannes Berg <johannes.berg@intel.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      af72868b
    • Johannes Berg's avatar
      networking: make skb_put & friends return void pointers · 4df864c1
      Johannes Berg authored
      It seems like a historic accident that these return unsigned char *,
      and in many places that means casts are required, more often than not.
      
      Make these functions (skb_put, __skb_put and pskb_put) return void *
      and remove all the casts across the tree, adding a (u8 *) cast only
      where the unsigned char pointer was used directly, all done with the
      following spatch:
      
          @@
          expression SKB, LEN;
          typedef u8;
          identifier fn = { skb_put, __skb_put };
          @@
          - *(fn(SKB, LEN))
          + *(u8 *)fn(SKB, LEN)
      
          @@
          expression E, SKB, LEN;
          identifier fn = { skb_put, __skb_put };
          type T;
          @@
          - E = ((T *)(fn(SKB, LEN)))
          + E = fn(SKB, LEN)
      
      which actually doesn't cover pskb_put since there are only three
      users overall.
      
      A handful of stragglers were converted manually, notably a macro in
      drivers/isdn/i4l/isdn_bsdcomp.c and, oddly enough, one of the many
      instances in net/bluetooth/hci_sock.c. In the former file, I also
      had to fix one whitespace problem spatch introduced.
      Signed-off-by: default avatarJohannes Berg <johannes.berg@intel.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      4df864c1
    • Johannes Berg's avatar
      networking: introduce and use skb_put_data() · 59ae1d12
      Johannes Berg authored
      A common pattern with skb_put() is to just want to memcpy()
      some data into the new space, introduce skb_put_data() for
      this.
      
      An spatch similar to the one for skb_put_zero() converts many
      of the places using it:
      
          @@
          identifier p, p2;
          expression len, skb, data;
          type t, t2;
          @@
          (
          -p = skb_put(skb, len);
          +p = skb_put_data(skb, data, len);
          |
          -p = (t)skb_put(skb, len);
          +p = skb_put_data(skb, data, len);
          )
          (
          p2 = (t2)p;
          -memcpy(p2, data, len);
          |
          -memcpy(p, data, len);
          )
      
          @@
          type t, t2;
          identifier p, p2;
          expression skb, data;
          @@
          t *p;
          ...
          (
          -p = skb_put(skb, sizeof(t));
          +p = skb_put_data(skb, data, sizeof(t));
          |
          -p = (t *)skb_put(skb, sizeof(t));
          +p = skb_put_data(skb, data, sizeof(t));
          )
          (
          p2 = (t2)p;
          -memcpy(p2, data, sizeof(*p));
          |
          -memcpy(p, data, sizeof(*p));
          )
      
          @@
          expression skb, len, data;
          @@
          -memcpy(skb_put(skb, len), data, len);
          +skb_put_data(skb, data, len);
      
      (again, manually post-processed to retain some comments)
      Reviewed-by: default avatarStephen Hemminger <stephen@networkplumber.org>
      Signed-off-by: default avatarJohannes Berg <johannes.berg@intel.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      59ae1d12
    • Johannes Berg's avatar
      networking: convert many more places to skb_put_zero() · b080db58
      Johannes Berg authored
      There were many places that my previous spatch didn't find,
      as pointed out by yuan linyu in various patches.
      
      The following spatch found many more and also removes the
      now unnecessary casts:
      
          @@
          identifier p, p2;
          expression len;
          expression skb;
          type t, t2;
          @@
          (
          -p = skb_put(skb, len);
          +p = skb_put_zero(skb, len);
          |
          -p = (t)skb_put(skb, len);
          +p = skb_put_zero(skb, len);
          )
          ... when != p
          (
          p2 = (t2)p;
          -memset(p2, 0, len);
          |
          -memset(p, 0, len);
          )
      
          @@
          type t, t2;
          identifier p, p2;
          expression skb;
          @@
          t *p;
          ...
          (
          -p = skb_put(skb, sizeof(t));
          +p = skb_put_zero(skb, sizeof(t));
          |
          -p = (t *)skb_put(skb, sizeof(t));
          +p = skb_put_zero(skb, sizeof(t));
          )
          ... when != p
          (
          p2 = (t2)p;
          -memset(p2, 0, sizeof(*p));
          |
          -memset(p, 0, sizeof(*p));
          )
      
          @@
          expression skb, len;
          @@
          -memset(skb_put(skb, len), 0, len);
          +skb_put_zero(skb, len);
      
      Apply it to the tree (with one manual fixup to keep the
      comment in vxlan.c, which spatch removed.)
      Signed-off-by: default avatarJohannes Berg <johannes.berg@intel.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      b080db58
    • David S. Miller's avatar
      Merge branch 'r8152-adjust-runtime-suspend-resume' · 61f73d1e
      David S. Miller authored
      Hayes Wang says:
      
      ====================
      r8152: adjust runtime suspend/resume
      
      v2:
      For #1, replace GFP_KERNEL with GFP_NOIO for usb_submit_urb().
      
      v1:
      Improve the flow about runtime suspend/resume and make the code
      easy to read.
      ====================
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      61f73d1e
    • hayeswang's avatar
      r8152: move calling delay_autosuspend function · bd882982
      hayeswang authored
      Move calling delay_autosuspend() in rtl8152_runtime_suspend(). Calling
      delay_autosuspend() as late as possible.
      
      The original flows are
         1. check if the driver/device is busy now.
         2. set wake events.
         3. enter runtime suspend.
      
      If the wake event occurs between (1) and (2), the device may miss it. Besides,
      to avoid the runtime resume occurs after runtime suspend immediately, move the
      checking to the end of rtl8152_runtime_suspend().
      Signed-off-by: default avatarHayes Wang <hayeswang@realtek.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      bd882982
    • hayeswang's avatar
      r8152: split rtl8152_resume function · 21cbd0ec
      hayeswang authored
      Split rtl8152_resume() into rtl8152_runtime_resume() and
      rtl8152_system_resume().
      
      Besides, replace GFP_KERNEL with GFP_NOIO for usb_submit_urb().
      Signed-off-by: default avatarHayes Wang <hayeswang@realtek.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      21cbd0ec
    • David S. Miller's avatar
      tls: Depend upon INET not plain NET. · 54144b48
      David S. Miller authored
      We refer to TCP et al. symbols so have to use INET as
      the dependency.
      
         ERROR: "tcp_prot" [net/tls/tls.ko] undefined!
      >> ERROR: "tcp_rate_check_app_limited" [net/tls/tls.ko] undefined!
         ERROR: "tcp_register_ulp" [net/tls/tls.ko] undefined!
         ERROR: "tcp_unregister_ulp" [net/tls/tls.ko] undefined!
         ERROR: "do_tcp_sendpages" [net/tls/tls.ko] undefined!
      Reported-by: default avatarkbuild test robot <fengguang.wu@intel.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      54144b48
    • David S. Miller's avatar
      Merge branch 'mlx4-XDP-performance-improvements' · 117b07e6
      David S. Miller authored
      Tariq Toukan says:
      
      ====================
      mlx4 XDP performance improvements
      
      This patchset contains data-path improvements, mainly for XDP_DROP
      and XDP_TX cases.
      
      Main patches:
      * Patch 2 by Saeed allows enabling optimized A0 RX steering (in HW) when
        setting a single RX ring.
        With this configuration, HW packet-rate dramatically improves,
        reaching 28.1 Mpps in XDP_DROP case for both IPv4 (37% gain)
        and IPv6 (53% gain).
      * Patch 6 enhances the XDP xmit function. Among other changes, now we
        ring one doorbell per NAPI. Patch gives 17% gain in XDP_TX case.
      * Patch 7 obsoletes the NAPI of XDP_TX completion queue and integrates its
        poll into the respective RX NAPI. Patch gives 15% gain in XDP_TX case.
      
      Series generated against net-next commit:
      f7aec129 rxrpc: Cache the congestion window setting
      ====================
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      117b07e6
    • Tariq Toukan's avatar
      net/mlx4_en: Refactor mlx4_en_free_tx_desc · 4c07c132
      Tariq Toukan authored
      Some code re-ordering, functionally equivalent.
      
      - The !tx_info->inl check is evaluated anyway in both flows
        (common case/end case). Run it first, this might finish
        the flows earlier.
      - dma_unmap calls are identical in both flows, get it out
        of the if block into the common area.
      
      Performance tests:
      Tested on ConnectX3Pro, Intel(R) Xeon(R) CPU E5-2680 v3 @ 2.50GHz
      
      Gain is too small to be measurable, no degradation sensed.
      Results are similar for IPv4 and IPv6.
      Signed-off-by: default avatarTariq Toukan <tariqt@mellanox.com>
      Reviewed-by: default avatarSaeed Mahameed <saeedm@mellanox.com>
      Cc: kernel-team@fb.com
      Cc: Eric Dumazet <edumazet@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      4c07c132
    • Tariq Toukan's avatar
      net/mlx4_en: Replace TXBB_SIZE multiplications with shift operations · 9573e0d3
      Tariq Toukan authored
      Define LOG_TXBB_SIZE, log of TXBB_SIZE, and use it with a shift
      operation instead of a multiplication with TXBB_SIZE.
      Operations are equivalent as TXBB_SIZE is a power of two.
      
      Performance tests:
      Tested on ConnectX3Pro, Intel(R) Xeon(R) CPU E5-2680 v3 @ 2.50GHz
      
      Gain is too small to be measurable, no degradation sensed.
      Results are similar for IPv4 and IPv6.
      Signed-off-by: default avatarTariq Toukan <tariqt@mellanox.com>
      Reviewed-by: default avatarSaeed Mahameed <saeedm@mellanox.com>
      Cc: kernel-team@fb.com
      Cc: Eric Dumazet <edumazet@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      9573e0d3
    • Tariq Toukan's avatar
      net/mlx4_en: Increase default TX ring size · 77788b5b
      Tariq Toukan authored
      Increase the default TX ring size (from 512 to 1024) to match
      the RX ring size.
      This gives the XDP TX ring a better chance to keep up with the
      rate of its RX ring in case of a high load of XDP_TX actions.
      
      Tested:
      Ethtool counter rx_xdp_tx_full used to increase, after applying this
      patch it stopped.
      Signed-off-by: default avatarTariq Toukan <tariqt@mellanox.com>
      Reviewed-by: default avatarSaeed Mahameed <saeedm@mellanox.com>
      Cc: kernel-team@fb.com
      Cc: Eric Dumazet <edumazet@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      77788b5b
    • Tariq Toukan's avatar
      net/mlx4_en: Poll XDP TX completion queue in RX NAPI · 6c78511b
      Tariq Toukan authored
      Instead of having their own NAPIs, XDP TX completion queues get
      polled within the corresponding RX NAPI.
      This prevents any possible race on TX ring prod/cons indices,
      between the context that issues the transmits (RX NAPI) and the
      context that handles the completions (was previously done in
      a separate NAPI).
      
      This also improves performance, as it decreases the number
      of NAPIs running on a CPU, saving the overhead of syncing
      and switching between the contexts.
      
      Performance tests:
      Tested on ConnectX3Pro, Intel(R) Xeon(R) CPU E5-2680 v3 @ 2.50GHz
      Single queue no-RSS optimization ON.
      
      XDP_TX packet rate:
      -------------------------------------
           | Before    | After     | Gain |
      IPv4 | 12.0 Mpps | 13.8 Mpps |  15% |
      IPv6 | 12.0 Mpps | 13.8 Mpps |  15% |
      -------------------------------------
      Signed-off-by: default avatarTariq Toukan <tariqt@mellanox.com>
      Reviewed-by: default avatarSaeed Mahameed <saeedm@mellanox.com>
      Cc: kernel-team@fb.com
      Cc: Eric Dumazet <edumazet@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      6c78511b
    • Tariq Toukan's avatar
      net/mlx4_en: Improve XDP xmit function · 36ea7964
      Tariq Toukan authored
      Several performance improvements in XDP TX datapath,
      including:
      - Ring a single doorbell for XDP TX ring per NAPI budget,
        instead of doing it per a lower threshold (was 8).
        This includes removing the flow of immediate doorbell ringing
        in case of a full TX ring.
      - Compiler branch predictor hints.
      - Calculate values in compile time rather than in runtime.
      
      Performance tests:
      Tested on ConnectX3Pro, Intel(R) Xeon(R) CPU E5-2680 v3 @ 2.50GHz
      Single queue no-RSS optimization ON.
      
      XDP_TX packet rate:
      -------------------------------------
           | Before    | After     | Gain |
      IPv4 | 10.3 Mpps | 12.0 Mpps |  17% |
      IPv6 | 10.3 Mpps | 12.0 Mpps |  17% |
      -------------------------------------
      Signed-off-by: default avatarTariq Toukan <tariqt@mellanox.com>
      Reviewed-by: default avatarSaeed Mahameed <saeedm@mellanox.com>
      Cc: kernel-team@fb.com
      Cc: Eric Dumazet <edumazet@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      36ea7964
    • Tariq Toukan's avatar
      net/mlx4_en: Improve stack xmit function · f28186d6
      Tariq Toukan authored
      Several small code and performance improvements in stack TX datapath,
      including:
      - Compiler branch predictor hints.
      - Minimize variables scope.
      - Move tx_info non-inline flow handling to a separate function.
      - Calculate data_offset in compile time rather than in runtime
        (for !lso_header_size branch).
      - Avoid trinary-operator ("?") when value can be preset in a matching
        branch.
      
      Performance tests:
      Tested on ConnectX3Pro, Intel(R) Xeon(R) CPU E5-2680 v3 @ 2.50GHz
      
      Gain is too small to be measurable, no degradation sensed.
      Results are similar for IPv4 and IPv6.
      Signed-off-by: default avatarTariq Toukan <tariqt@mellanox.com>
      Reviewed-by: default avatarSaeed Mahameed <saeedm@mellanox.com>
      Cc: kernel-team@fb.com
      Cc: Eric Dumazet <edumazet@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      f28186d6
    • Tariq Toukan's avatar
      net/mlx4_en: Improve transmit CQ polling · cc26a490
      Tariq Toukan authored
      Several small performance improvements in TX CQ polling,
      including:
      - Compiler branch predictor hints.
      - Minimize variables scope.
      - More proper check of cq type.
      - Use boolean instead of int for a binary indication.
      
      Performance tests:
      Tested on ConnectX3Pro, Intel(R) Xeon(R) CPU E5-2680 v3 @ 2.50GHz
      
      Packet-rate tests for both regular stack and XDP use cases:
      No noticeable gain, no degradation.
      Signed-off-by: default avatarTariq Toukan <tariqt@mellanox.com>
      Reviewed-by: default avatarSaeed Mahameed <saeedm@mellanox.com>
      Cc: kernel-team@fb.com
      Cc: Eric Dumazet <edumazet@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      cc26a490
    • Tariq Toukan's avatar
      net/mlx4_en: Improve receive data-path · 9bcee89a
      Tariq Toukan authored
      Several small performance improvements in RX datapath,
      including:
      - Compiler branch predictor hints.
      - Replace a multiplication with a shift operation.
      - Minimize variables scope.
      - Write-prefetch for packet header.
      - Avoid trinary-operator ("?") when value can be preset in a matching
        branch.
      - Save a branch by updating RX ring doorbell within
        mlx4_en_refill_rx_buffers(), which now returns void.
      
      Performance tests:
      Tested on ConnectX3Pro, Intel(R) Xeon(R) CPU E5-2680 v3 @ 2.50GHz
      Single queue no-RSS optimization ON
      (enable by ethtool -L <interface> rx 1).
      
      XDP_DROP packet rate:
      Same (28.1 Mpps), lower CPU utilization (from ~100% to ~92%).
      
      Drop packets in TC:
      -------------------------------------
           | Before    | After     | Gain |
      IPv4 | 4.14 Mpps | 4.18 Mpps |   1% |
      -------------------------------------
      
      XDP_TX packet rate:
      -------------------------------------
           | Before    | After     | Gain |
      IPv4 | 10.1 Mpps | 10.3 Mpps |   2% |
      IPv6 | 10.1 Mpps | 10.3 Mpps |   2% |
      -------------------------------------
      Signed-off-by: default avatarTariq Toukan <tariqt@mellanox.com>
      Reviewed-by: default avatarSaeed Mahameed <saeedm@mellanox.com>
      Cc: kernel-team@fb.com
      Cc: Eric Dumazet <edumazet@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      9bcee89a
    • Saeed Mahameed's avatar
      net/mlx4_en: Optimized single ring steering · 4931c6ef
      Saeed Mahameed authored
      Avoid touching RX QP RSS context when loading with only
      one RX ring, to allow optimized A0 RX steering.
      
      Enable by:
      - loading mlx4_core with module param: log_num_mgm_entry_size = -6.
      - then: ethtool -L <interface> rx 1
      
      Performance tests:
      Tested on ConnectX3Pro, Intel(R) Xeon(R) CPU E5-2680 v3 @ 2.50GHz
      
      XDP_DROP packet rate:
      -------------------------------------
           | Before    | After     | Gain |
      IPv4 | 20.5 Mpps | 28.1 Mpps |  37% |
      IPv6 | 18.4 Mpps | 28.1 Mpps |  53% |
      -------------------------------------
      Signed-off-by: default avatarSaeed Mahameed <saeedm@mellanox.com>
      Signed-off-by: default avatarTariq Toukan <tariqt@mellanox.com>
      Cc: kernel-team@fb.com
      Cc: Eric Dumazet <edumazet@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      4931c6ef
    • Tariq Toukan's avatar
      net/mlx4_en: Remove unused argument in TX datapath function · cf97050d
      Tariq Toukan authored
      Remove owner argument, as it is obsolete and unused.
      This also saves the overhead of calculating its value in data-path.
      Signed-off-by: default avatarTariq Toukan <tariqt@mellanox.com>
      Reviewed-by: default avatarSaeed Mahameed <saeedm@mellanox.com>
      Cc: kernel-team@fb.com
      Cc: Eric Dumazet <edumazet@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      cf97050d
  2. 15 Jun, 2017 21 commits