1. 21 Apr, 2023 1 commit
    • Linus Torvalds's avatar
      x86: rewrite '__copy_user_nocache' function · 034ff37d
      Linus Torvalds authored
      I didn't really want to do this, but as part of all the other changes to
      the user copy loops, I've been looking at this horror.
      
      I tried to clean it up multiple times, but every time I just found more
      problems, and the way it's written, it's just too hard to fix them.
      
      For example, the code is written to do quad-word alignment, and will use
      regular byte accesses to get to that point.  That's fairly simple, but
      it means that any initial 8-byte alignment will be done with cached
      copies.
      
      However, the code then is very careful to do any 4-byte _tail_ accesses
      using an uncached 4-byte write, and that was claimed to be relevant in
      commit a82eee74 ("x86/uaccess/64: Handle the caching of 4-byte
      nocache copies properly in __copy_user_nocache()").
      
      So if you do a 4-byte copy using that function, it carefully uses a
      4-byte 'movnti' for the destination.  But if you were to do a 12-byte
      copy that is 4-byte aligned, it would _not_ do a 4-byte 'movnti'
      followed by a 8-byte 'movnti' to keep it all uncached.
      
      Instead, it would align the destination to 8 bytes using a
      byte-at-a-time loop, and then do a 8-byte 'movnti' for the final 8
      bytes.
      
      The main caller that cares is __copy_user_flushcache(), which knows
      about this insanity, and has odd cases for it all.  But I just can't
      deal with looking at this kind of "it does one case right, and another
      related case entirely wrong".
      
      And the code really wasn't fixable without hard drugs, which I try to
      avoid.
      
      So instead, rewrite it in a form that hopefully not only gets this
      right, but is a bit more maintainable.  Knock wood.
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      034ff37d
  2. 20 Apr, 2023 1 commit
    • Linus Torvalds's avatar
      x86: remove 'zerorest' argument from __copy_user_nocache() · e1f2750e
      Linus Torvalds authored
      Every caller passes in zero, meaning they don't want any partial copy to
      zero the remainder of the destination buffer.
      
      Which is just as well, because the implementation of that function
      didn't actually even look at that argument, and wasn't even aware it
      existed, although some misleading comments did mention it still.
      
      The 'zerorest' thing is a historical artifact of how "copy_from_user()"
      worked, in that it would zero the rest of the kernel buffer that it
      copied into.
      
      That zeroing still exists, but it's long since been moved to generic
      code, and the raw architecture-specific code doesn't do it.  See
      _copy_from_user() in lib/usercopy.c for this all.
      
      However, while __copy_user_nocache() shares some history and superficial
      other similarities with copy_from_user(), it is in many ways also very
      different.
      
      In particular, while the code makes it *look* similar to the generic
      user copy functions that can copy both to and from user space, and take
      faults on both reads and writes as a result, __copy_user_nocache() does
      no such thing at all.
      
      __copy_user_nocache() always copies to kernel space, and will never take
      a page fault on the destination.  What *can* happen, though, is that the
      non-temporal stores take a machine check because one of the use cases is
      for writing to stable memory, and any memory errors would then take
      synchronous faults.
      
      So __copy_user_nocache() does look a lot like copy_from_user(), but has
      faulting behavior that is more akin to our old copy_in_user() (which no
      longer exists, but copied from user space to user space and could fault
      on both source and destination).
      
      And it very much does not have the "zero the end of the destination
      buffer", since a problem with the destination buffer is very possibly
      the very source of the partial copy.
      
      So this whole thing was just a confusing historical artifact from having
      shared some code with a completely different function with completely
      different use cases.
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      e1f2750e
  3. 19 Apr, 2023 9 commits
    • Linus Torvalds's avatar
      x86: set FSRS automatically on AMD CPUs that have FSRM · e046fe5a
      Linus Torvalds authored
      So Intel introduced the FSRS ("Fast Short REP STOS") CPU capability bit,
      because they seem to have done the (much simpler) REP STOS optimizations
      separately and later than the REP MOVS one.
      
      In contrast, when AMD introduced support for FSRM ("Fast Short REP
      MOVS"), in the Zen 3 core, it appears to have improved the REP STOS case
      at the same time, and since the FSRS bit was added by Intel later, it
      doesn't show up on those AMD Zen 3 cores.
      
      And now that we made use of FSRS for the "rep stos" conditional, that
      made those AMD machines unnecessarily slower.  The Intel situation where
      "rep movs" is fast, but "rep stos" isn't, is just odd.  The 'stos' case
      is a lot simpler with no aliasing, no mutual alignment issues, no
      complicated cases.
      
      So this just sets FSRS automatically when FSRM is available on AMD
      machines, to get back all the nice REP STOS goodness in Zen 3.
      Reported-and-tested-by: default avatarJens Axboe <axboe@kernel.dk>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      e046fe5a
    • Linus Torvalds's avatar
      x86: improve on the non-rep 'copy_user' function · 427fda2c
      Linus Torvalds authored
      The old 'copy_user_generic_unrolled' function was oddly implemented for
      largely historical reasons: it had been largely based on the uncached
      copy case, which has some other concerns.
      
      For example, the __copy_user_nocache() function uses 'movnti' for the
      destination stores, and those want the destination to be aligned.  In
      contrast, the regular copy function doesn't really care, and trying to
      align things only complicates matters.
      
      Also, like the clear_user function, the copy function had some odd
      handling of the repeat counts, complicating the exception handling for
      no really good reason.  So as with clear_user, just write it to keep all
      the byte counts in the %rcx register, exactly like the 'rep movs'
      functionality that this replaces.
      
      Unlike a real 'rep movs', we do allow for this to trash a few temporary
      registers to not have to unnecessarily save/restore registers on the
      stack.
      
      And like the clearing case, rename this to what it now clearly is:
      'rep_movs_alternative', and make it one coherent function, so that it
      shows up as such in profiles (instead of the odd split between
      "copy_user_generic_unrolled" and "copy_user_short_string", the latter of
      which was not about strings at all, and which was shared with the
      uncached case).
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      427fda2c
    • Linus Torvalds's avatar
      x86: improve on the non-rep 'clear_user' function · 8c9b6a88
      Linus Torvalds authored
      The old version was oddly written to have the repeat count in multiple
      registers.  So instead of taking advantage of %rax being zero, it had
      some sub-counts in it.  All just for a "single word clearing" loop,
      which isn't even efficient to begin with.
      
      So get rid of those games, and just keep all the state in the same
      registers we got it in (and that we should return things in).  That not
      only makes this act much more like 'rep stos' (which this function is
      replacing), but makes it much easier to actually do the obvious loop
      unrolling.
      
      Also rename the function from the now nonsensical 'clear_user_original'
      to what it now clearly is: 'rep_stos_alternative'.
      
      End result: if we don't have a fast 'rep stosb', at least we can have a
      fast fallback for it.
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      8c9b6a88
    • Linus Torvalds's avatar
      x86: inline the 'rep movs' in user copies for the FSRM case · 577e6a7f
      Linus Torvalds authored
      This does the same thing for the user copies as commit 0db7058e
      ("x86/clear_user: Make it faster") did for clear_user().  In other
      words, it inlines the "rep movs" case when X86_FEATURE_FSRM is set,
      avoiding the function call entirely.
      
      In order to do that, it makes the calling convention for the out-of-line
      case ("copy_user_generic_unrolled") match the 'rep movs' calling
      convention, although it does also end up clobbering a number of
      additional registers.
      
      Also, to simplify code sharing in the low-level assembly with the
      __copy_user_nocache() function (that uses the normal C calling
      convention), we end up with a kind of mixed return value for the
      low-level asm code: it will return the result in both %rcx (to work as
      an alternative for the 'rep movs' case), _and_ in %rax (for the nocache
      case).
      
      We could avoid this by wrapping __copy_user_nocache() callers in an
      inline asm, but since the cost is just an extra register copy, it's
      probably not worth it.
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      577e6a7f
    • Linus Torvalds's avatar
      x86: move stac/clac from user copy routines into callers · 3639a535
      Linus Torvalds authored
      This is preparatory work for inlining the 'rep movs' case, but also a
      cleanup.  The __copy_user_nocache() function was mis-used by the rdma
      code to do uncached kernel copies that don't actually want user copies
      at all, and as a result doesn't want the stac/clac either.
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      3639a535
    • Linus Torvalds's avatar
      x86: don't use REP_GOOD or ERMS for user memory clearing · d2c95f9d
      Linus Torvalds authored
      The modern target to use is FSRS (Fast Short REP STOS), and the other
      cases should only be used for bigger areas (ie mainly things like page
      clearing).
      
      Note! This changes the conditional for the inlining from FSRM ("fast
      short rep movs") to FSRS ("fast short rep stos").
      
      We'll have a separate fixup for AMD microarchitectures that have a good
      'rep stosb' yet do not set the new Intel-specific FSRS bit (because FSRM
      was there first).
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      d2c95f9d
    • Linus Torvalds's avatar
      x86: don't use REP_GOOD or ERMS for user memory copies · adfcf423
      Linus Torvalds authored
      The modern target to use is FSRM (Fast Short REP MOVS), and the other
      cases should only be used for bigger areas (ie mainly things like page
      clearing).
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      adfcf423
    • Linus Torvalds's avatar
      x86: don't use REP_GOOD or ERMS for small memory clearing · 20f3337d
      Linus Torvalds authored
      The modern target to use is FSRS (Fast Short REP STOS), and the other
      cases should only be used for bigger areas (ie mainly things like page
      clearing).
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      20f3337d
    • Linus Torvalds's avatar
      x86: don't use REP_GOOD or ERMS for small memory copies · 68674f94
      Linus Torvalds authored
      The modern target to use is FSRM (Fast Short REP MOVS), and the other
      cases should only be used for bigger areas (ie mainly things like page
      copying and clearing).
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      68674f94
  4. 16 Apr, 2023 12 commits
  5. 15 Apr, 2023 6 commits
  6. 14 Apr, 2023 11 commits