1. 19 Apr, 2023 9 commits
    • Linus Torvalds's avatar
      x86: set FSRS automatically on AMD CPUs that have FSRM · e046fe5a
      Linus Torvalds authored
      So Intel introduced the FSRS ("Fast Short REP STOS") CPU capability bit,
      because they seem to have done the (much simpler) REP STOS optimizations
      separately and later than the REP MOVS one.
      
      In contrast, when AMD introduced support for FSRM ("Fast Short REP
      MOVS"), in the Zen 3 core, it appears to have improved the REP STOS case
      at the same time, and since the FSRS bit was added by Intel later, it
      doesn't show up on those AMD Zen 3 cores.
      
      And now that we made use of FSRS for the "rep stos" conditional, that
      made those AMD machines unnecessarily slower.  The Intel situation where
      "rep movs" is fast, but "rep stos" isn't, is just odd.  The 'stos' case
      is a lot simpler with no aliasing, no mutual alignment issues, no
      complicated cases.
      
      So this just sets FSRS automatically when FSRM is available on AMD
      machines, to get back all the nice REP STOS goodness in Zen 3.
      Reported-and-tested-by: default avatarJens Axboe <axboe@kernel.dk>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      e046fe5a
    • Linus Torvalds's avatar
      x86: improve on the non-rep 'copy_user' function · 427fda2c
      Linus Torvalds authored
      The old 'copy_user_generic_unrolled' function was oddly implemented for
      largely historical reasons: it had been largely based on the uncached
      copy case, which has some other concerns.
      
      For example, the __copy_user_nocache() function uses 'movnti' for the
      destination stores, and those want the destination to be aligned.  In
      contrast, the regular copy function doesn't really care, and trying to
      align things only complicates matters.
      
      Also, like the clear_user function, the copy function had some odd
      handling of the repeat counts, complicating the exception handling for
      no really good reason.  So as with clear_user, just write it to keep all
      the byte counts in the %rcx register, exactly like the 'rep movs'
      functionality that this replaces.
      
      Unlike a real 'rep movs', we do allow for this to trash a few temporary
      registers to not have to unnecessarily save/restore registers on the
      stack.
      
      And like the clearing case, rename this to what it now clearly is:
      'rep_movs_alternative', and make it one coherent function, so that it
      shows up as such in profiles (instead of the odd split between
      "copy_user_generic_unrolled" and "copy_user_short_string", the latter of
      which was not about strings at all, and which was shared with the
      uncached case).
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      427fda2c
    • Linus Torvalds's avatar
      x86: improve on the non-rep 'clear_user' function · 8c9b6a88
      Linus Torvalds authored
      The old version was oddly written to have the repeat count in multiple
      registers.  So instead of taking advantage of %rax being zero, it had
      some sub-counts in it.  All just for a "single word clearing" loop,
      which isn't even efficient to begin with.
      
      So get rid of those games, and just keep all the state in the same
      registers we got it in (and that we should return things in).  That not
      only makes this act much more like 'rep stos' (which this function is
      replacing), but makes it much easier to actually do the obvious loop
      unrolling.
      
      Also rename the function from the now nonsensical 'clear_user_original'
      to what it now clearly is: 'rep_stos_alternative'.
      
      End result: if we don't have a fast 'rep stosb', at least we can have a
      fast fallback for it.
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      8c9b6a88
    • Linus Torvalds's avatar
      x86: inline the 'rep movs' in user copies for the FSRM case · 577e6a7f
      Linus Torvalds authored
      This does the same thing for the user copies as commit 0db7058e
      ("x86/clear_user: Make it faster") did for clear_user().  In other
      words, it inlines the "rep movs" case when X86_FEATURE_FSRM is set,
      avoiding the function call entirely.
      
      In order to do that, it makes the calling convention for the out-of-line
      case ("copy_user_generic_unrolled") match the 'rep movs' calling
      convention, although it does also end up clobbering a number of
      additional registers.
      
      Also, to simplify code sharing in the low-level assembly with the
      __copy_user_nocache() function (that uses the normal C calling
      convention), we end up with a kind of mixed return value for the
      low-level asm code: it will return the result in both %rcx (to work as
      an alternative for the 'rep movs' case), _and_ in %rax (for the nocache
      case).
      
      We could avoid this by wrapping __copy_user_nocache() callers in an
      inline asm, but since the cost is just an extra register copy, it's
      probably not worth it.
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      577e6a7f
    • Linus Torvalds's avatar
      x86: move stac/clac from user copy routines into callers · 3639a535
      Linus Torvalds authored
      This is preparatory work for inlining the 'rep movs' case, but also a
      cleanup.  The __copy_user_nocache() function was mis-used by the rdma
      code to do uncached kernel copies that don't actually want user copies
      at all, and as a result doesn't want the stac/clac either.
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      3639a535
    • Linus Torvalds's avatar
      x86: don't use REP_GOOD or ERMS for user memory clearing · d2c95f9d
      Linus Torvalds authored
      The modern target to use is FSRS (Fast Short REP STOS), and the other
      cases should only be used for bigger areas (ie mainly things like page
      clearing).
      
      Note! This changes the conditional for the inlining from FSRM ("fast
      short rep movs") to FSRS ("fast short rep stos").
      
      We'll have a separate fixup for AMD microarchitectures that have a good
      'rep stosb' yet do not set the new Intel-specific FSRS bit (because FSRM
      was there first).
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      d2c95f9d
    • Linus Torvalds's avatar
      x86: don't use REP_GOOD or ERMS for user memory copies · adfcf423
      Linus Torvalds authored
      The modern target to use is FSRM (Fast Short REP MOVS), and the other
      cases should only be used for bigger areas (ie mainly things like page
      clearing).
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      adfcf423
    • Linus Torvalds's avatar
      x86: don't use REP_GOOD or ERMS for small memory clearing · 20f3337d
      Linus Torvalds authored
      The modern target to use is FSRS (Fast Short REP STOS), and the other
      cases should only be used for bigger areas (ie mainly things like page
      clearing).
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      20f3337d
    • Linus Torvalds's avatar
      x86: don't use REP_GOOD or ERMS for small memory copies · 68674f94
      Linus Torvalds authored
      The modern target to use is FSRM (Fast Short REP MOVS), and the other
      cases should only be used for bigger areas (ie mainly things like page
      copying and clearing).
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      68674f94
  2. 16 Apr, 2023 12 commits
  3. 15 Apr, 2023 6 commits
  4. 14 Apr, 2023 13 commits