• Ard Biesheuvel's avatar
    arm64/lib: improve CRC32 performance for deep pipelines · efdb25ef
    Ard Biesheuvel authored
    Improve the performance of the crc32() asm routines by getting rid of
    most of the branches and small sized loads on the common path.
    
    Instead, use a branchless code path involving overlapping 16 byte
    loads to process the first (length % 32) bytes, and process the
    remainder using a loop that processes 32 bytes at a time.
    
    Tested using the following test program:
    
      #include <stdlib.h>
    
      extern void crc32_le(unsigned short, char const*, int);
    
      int main(void)
      {
        static const char buf[4096];
    
        srand(20181126);
    
        for (int i = 0; i < 100 * 1000 * 1000; i++)
          crc32_le(0, buf, rand() % 1024);
    
        return 0;
      }
    
    On Cortex-A53 and Cortex-A57, the performance regresses but only very
    slightly. On Cortex-A72 however, the performance improves from
    
      $ time ./crc32
    
      real  0m10.149s
      user  0m10.149s
      sys   0m0.000s
    
    to
    
      $ time ./crc32
    
      real  0m7.915s
      user  0m7.915s
      sys   0m0.000s
    
    Cc: Rui Sun <sunrui26@huawei.com>
    Signed-off-by: default avatarArd Biesheuvel <ard.biesheuvel@linaro.org>
    Signed-off-by: default avatarWill Deacon <will.deacon@arm.com>
    efdb25ef
crc32.S 2.04 KB