• Borislav Petkov's avatar
    x86/clear_user: Make it faster · 0db7058e
    Borislav Petkov authored
    Based on a patch by Mark Hemment <markhemm@googlemail.com> and
    incorporating very sane suggestions from Linus.
    
    The point here is to have the default case with FSRM - which is supposed
    to be the majority of x86 hw out there - if not now then soon - be
    directly inlined into the instruction stream so that no function call
    overhead is taking place.
    
    Drop the early clobbers from the @size and @addr operands as those are
    not needed anymore since we have single instruction alternatives.
    
    The benchmarks I ran would show very small improvements and a PF
    benchmark would even show weird things like slowdowns with higher core
    counts.
    
    So for a ~6m running the git test suite, the function gets called under
    700K times, all from padzero():
    
      <...>-2536    [006] .....   261.208801: padzero: to: 0x55b0663ed214, size: 3564, cycles: 21900
      <...>-2536    [006] .....   261.208819: padzero: to: 0x7f061adca078, size: 3976, cycles: 17160
      <...>-2537    [008] .....   261.211027: padzero: to: 0x5572d019e240, size: 3520, cycles: 23850
      <...>-2537    [008] .....   261.211049: padzero: to: 0x7f1288dc9078, size: 3976, cycles: 15900
       ...
    
    which is around 1%-ish of the total time and which is consistent with
    the benchmark numbers.
    
    So Mel gave me the idea to simply measure how fast the function becomes.
    I.e.:
    
      start = rdtsc_ordered();
      ret = __clear_user(to, n);
      end = rdtsc_ordered();
    
    Computing the mean average of all the samples collected during the test
    suite run then shows some improvement:
    
      clear_user_original:
      Amean: 9219.71 (Sum: 6340154910, samples: 687674)
    
      fsrm:
      Amean: 8030.63 (Sum: 5522277720, samples: 687652)
    
    That's on Zen3.
    
    The situation looks a lot more confusing on Intel:
    
    Icelake:
    
      clear_user_original:
      Amean: 19679.4 (Sum: 13652560764, samples: 693750)
      Amean: 19743.7 (Sum: 13693470604, samples: 693562)
    
    (I ran it twice just to be sure.)
    
      ERMS:
      Amean: 20374.3 (Sum: 13910601024, samples: 682752)
      Amean: 20453.7 (Sum: 14186223606, samples: 693576)
    
      FSRM:
      Amean: 20458.2 (Sum: 13918381386, sample s: 680331)
    
    The original microbenchmark which people were complaining about:
    
      for i in $(seq 1 10); do dd if=/dev/zero of=/dev/null bs=1M status=progress count=65536; done 2>&1 | grep copied
      32207011840 bytes (32 GB, 30 GiB) copied, 1 s, 32.2 GB/s
      68719476736 bytes (69 GB, 64 GiB) copied, 1.93069 s, 35.6 GB/s
      37597741056 bytes (38 GB, 35 GiB) copied, 1 s, 37.6 GB/s
      68719476736 bytes (69 GB, 64 GiB) copied, 1.78017 s, 38.6 GB/s
      62020124672 bytes (62 GB, 58 GiB) copied, 2 s, 31.0 GB/s
      68719476736 bytes (69 GB, 64 GiB) copied, 2.13716 s, 32.2 GB/s
      60010004480 bytes (60 GB, 56 GiB) copied, 1 s, 60.0 GB/s
      68719476736 bytes (69 GB, 64 GiB) copied, 1.14129 s, 60.2 GB/s
      53212086272 bytes (53 GB, 50 GiB) copied, 1 s, 53.2 GB/s
      68719476736 bytes (69 GB, 64 GiB) copied, 1.28398 s, 53.5 GB/s
      55698259968 bytes (56 GB, 52 GiB) copied, 1 s, 55.7 GB/s
      68719476736 bytes (69 GB, 64 GiB) copied, 1.22507 s, 56.1 GB/s
      55306092544 bytes (55 GB, 52 GiB) copied, 1 s, 55.3 GB/s
      68719476736 bytes (69 GB, 64 GiB) copied, 1.23647 s, 55.6 GB/s
      54387539968 bytes (54 GB, 51 GiB) copied, 1 s, 54.4 GB/s
      68719476736 bytes (69 GB, 64 GiB) copied, 1.25693 s, 54.7 GB/s
      50566529024 bytes (51 GB, 47 GiB) copied, 1 s, 50.6 GB/s
      68719476736 bytes (69 GB, 64 GiB) copied, 1.35096 s, 50.9 GB/s
      58308165632 bytes (58 GB, 54 GiB) copied, 1 s, 58.3 GB/s
      68719476736 bytes (69 GB, 64 GiB) copied, 1.17394 s, 58.5 GB/s
    
    Now the same thing with smaller buffers:
    
      for i in $(seq 1 10); do dd if=/dev/zero of=/dev/null bs=1M status=progress count=8192; done 2>&1 | grep copied
      8589934592 bytes (8.6 GB, 8.0 GiB) copied, 0.28485 s, 30.2 GB/s
      8589934592 bytes (8.6 GB, 8.0 GiB) copied, 0.276112 s, 31.1 GB/s
      8589934592 bytes (8.6 GB, 8.0 GiB) copied, 0.29136 s, 29.5 GB/s
      8589934592 bytes (8.6 GB, 8.0 GiB) copied, 0.283803 s, 30.3 GB/s
      8589934592 bytes (8.6 GB, 8.0 GiB) copied, 0.306503 s, 28.0 GB/s
      8589934592 bytes (8.6 GB, 8.0 GiB) copied, 0.349169 s, 24.6 GB/s
      8589934592 bytes (8.6 GB, 8.0 GiB) copied, 0.276912 s, 31.0 GB/s
      8589934592 bytes (8.6 GB, 8.0 GiB) copied, 0.265356 s, 32.4 GB/s
      8589934592 bytes (8.6 GB, 8.0 GiB) copied, 0.28464 s, 30.2 GB/s
      8589934592 bytes (8.6 GB, 8.0 GiB) copied, 0.242998 s, 35.3 GB/s
    
    is also not conclusive because it all depends on the buffer sizes,
    their alignments and when the microcode detects that cachelines can be
    aggregated properly and copied in bigger sizes.
    Signed-off-by: default avatarBorislav Petkov <bp@suse.de>
    Link: https://lore.kernel.org/r/CAHk-=wh=Mu_EYhtOmPn6AxoQZyEh-4fo2Zx3G7rBv1g7vwoKiw@mail.gmail.com
    0db7058e
uaccess_64.h 3.45 KB