1. 05 Feb, 2021 8 commits
  2. 04 Feb, 2021 31 commits
  3. 03 Feb, 2021 1 commit
    • Jakub Kicinski's avatar
      Merge branch 'net-use-indirect_call-in-some-dst_ops' · 2d912da0
      Jakub Kicinski authored
      Brian Vazquez says:
      
      ====================
      net: use INDIRECT_CALL in some dst_ops
      
      This patch series uses the INDIRECT_CALL wrappers in some dst_ops
      functions to mitigate retpoline costs. Benefits depend on the
      platform as described below.
      
      Background: The kernel rewrites the retpoline code at
      __x86_indirect_thunk_r11 depending on the CPU's requirements.
      The INDIRECT_CALL wrappers provide hints on possible targets and
      save the retpoline overhead using a direct call in case the
      target matches one of the hints.
      
      The retpoline overhead for the following three cases has been
      measured by Luigi Rizzo in microbenchmarks, using CPU performance
      counters, and cover reasonably well the range of possible retpoline
      overheads compared to a plain indirect call (in equal conditions,
      specifically with predicted branch, hot cache):
      
      - just "jmp *(%r11)" on modern platforms like Intel Cascadelake.
        In this case the overhead is just 2 clock cycles:
      
      - "lfence; jmp *(%r11)" on e.g. some recent AMD CPUs.
        In this case the lfence is blocked until pending reads complete,
        so the actual overhead depends on previous instructions.
        The best case we have measured 15 clock cycles of overhead.
      
      - worst case, e.g. skylake, the full retpoline is used
      
          __x86_indirect_thunk_r11:     call set_u_target
          capture_speculation:          pause
                                        lfence
                                        jmp capture_speculation
          .align 16
          set_up_target:                mov %r11, (%rsp)
                                        ret
      
         In this case the overhead has been measured in 35-40 clock cycles.
      
      The actual time saved hence depends on the platform and current
      clock speed (which varies heavily, especially when C-states are active).
      Also note that actual benefit might be lower than expected if the
      longer retpoline overlaps with some pending memory read.
      
      MEASUREMENTS:
      The INDIRECT_CALL wrappers in this patchset involve the processing
      of incoming SYN and generation of syncookies. Hence, the test has been
      run by configuring a receiving host with a single NIC rx queue, disabling
      RPS and RFS so that all processing occurs on the same core.
      An external source generates SYN fast enough to saturate the receiving CPU.
      We ran two sets of experiments, with and without the dst_output patch,
      comparing the number of syncookies generated over a 20s period
      in multiple runs.
      
      Assuming the CPU is saturated, the time per packet is
         t = number_of_packets/total_time
      and if the two datasets have statistically meaningful difference,
      the difference in times between the two cases gives an estimate
      of the benefits from one INDIRECT_CALL.
      
      Here are the experimental results:
      
      Skylake     Syncookies over 20s (5 tests)
      ---------------------------------------------------
      indirect    9166325 9182023 9170093 9134014 9171082
      retpoline   9099308 9126350 9154841 9056377 9122376
      
      Computing the stats on the ns_pkt = 20e6/total_packets gives the following:
      
      $ ministat -c 95 -w 70 /tmp/sk-indirect /tmp/sk-retp
      x /tmp/sk-indirect
      + /tmp/sk-retp
      +----------------------------------------------------------------------+
      |x     xx x     +          x    + +           +                       +|
      ||______M__A_______|_|____________M_____A___________________|          |
      +----------------------------------------------------------------------+
          N           Min           Max        Median           Avg        Stddev
      x   5   2.17817e-06   2.18962e-06     2.181e-06  2.182292e-06 4.3252133e-09
      +   5   2.18464e-06   2.20839e-06   2.19241e-06  2.194974e-06 8.8695958e-09
      Difference at 95.0% confidence
              1.2682e-08 +/- 1.01766e-08
              0.581132% +/- 0.466326%
              (Student's t, pooled s = 6.97772e-09)
      
      This suggests a difference of 13ns +/- 10ns
      Our expectation from microbenchmarks was 35-40 cycles per call,
      but part of the gains may be eaten by stalls from pending memory reads.
      
      For Cascadelake:
      Cascadelake     Syncookies over 20s (5 tests)
      ---------------------------------------------------------
      indirect     10339797 10297547 10366826 10378891 10384854
      retpoline    10332674 10366805 10320374 10334272 10374087
      
      Computing the stats on the ns_pkt = 20e6/total_packets gives no
      meaningful difference even at just 80% (this was expected):
      
      $ ministat -c 80 -w 70 /tmp/cl-indirect /tmp/cl-retp
      x /tmp/cl-indirect
      + /tmp/cl-retp
      +----------------------------------------------------------------------+
      |   x    x  +     *                   x   + +        +                x|
      ||______________|_M_________A_____A_______M________|___|               |
      +----------------------------------------------------------------------+
          N           Min           Max        Median           Avg        Stddev
      x   5   1.92588e-06   1.94221e-06   1.92923e-06  1.931716e-06 6.6936746e-09
      +   5   1.92788e-06   1.93791e-06   1.93531e-06  1.933188e-06 4.3734106e-09
      No difference proven at 80.0% confidence
      ====================
      
      Link: https://lore.kernel.org/r/20210201174132.3534118-1-brianvv@google.comSigned-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      2d912da0