1. 04 Nov, 2022 17 commits
    • Tianjia Zhang's avatar
      crypto: arm64/sm4 - add CE implementation for GCM mode · ae1b83c7
      Tianjia Zhang authored
      This patch is a CE-optimized assembly implementation for GCM mode.
      
      Benchmark on T-Head Yitian-710 2.75 GHz, the data comes from the 224 and 224
      modes of tcrypt, and compared the performance before and after this patch (the
      driver used before this patch is gcm_base(ctr-sm4-ce,ghash-generic)).
      The abscissas are blocks of different lengths. The data is tabulated and the
      unit is Mb/s:
      
      Before (gcm_base(ctr-sm4-ce,ghash-generic)):
      
      gcm(sm4)     |     16      64      256      512     1024     1420     4096     8192
      -------------+---------------------------------------------------------------------
        GCM enc    |  25.24   64.65   104.66   116.69   123.81   125.12   129.67   130.62
        GCM dec    |  25.40   64.80   104.74   116.70   123.81   125.21   129.68   130.59
        GCM mb enc |  24.95   64.06   104.20   116.38   123.55   124.97   129.63   130.61
        GCM mb dec |  24.92   64.00   104.13   116.34   123.55   124.98   129.56   130.48
      
      After:
      
      gcm-sm4-ce   |     16      64      256      512     1024     1420     4096     8192
      -------------+---------------------------------------------------------------------
        GCM enc    | 108.62  397.18   971.60  1283.92  1522.77  1513.39  1777.00  1806.96
        GCM dec    | 116.36  398.14  1004.27  1319.11  1624.21  1635.43  1932.54  1974.20
        GCM mb enc | 107.13  391.79   962.05  1274.94  1514.76  1508.57  1769.07  1801.58
        GCM mb dec | 113.40  389.36   988.51  1307.68  1619.10  1631.55  1931.70  1970.86
      Signed-off-by: default avatarTianjia Zhang <tianjia.zhang@linux.alibaba.com>
      Signed-off-by: default avatarHerbert Xu <herbert@gondor.apana.org.au>
      ae1b83c7
    • Tianjia Zhang's avatar
      crypto: arm64/sm4 - add CE implementation for CCM mode · 67fa3a7f
      Tianjia Zhang authored
      This patch is a CE-optimized assembly implementation for CCM mode.
      
      Benchmark on T-Head Yitian-710 2.75 GHz, the data comes from the 223 and 225
      modes of tcrypt, and compared the performance before and after this patch (the
      driver used before this patch is ccm_base(ctr-sm4-ce,cbcmac-sm4-ce)).
      The abscissas are blocks of different lengths. The data is tabulated and the
      unit is Mb/s:
      
      Before (rfc4309(ccm_base(ctr-sm4-ce,cbcmac-sm4-ce))):
      
      ccm(sm4)     |     16      64     256     512    1024    1420    4096    8192
      -------------+---------------------------------------------------------------
        CCM enc    |  35.07  125.40  336.47  468.17  581.97  619.18  712.56  736.01
        CCM dec    |  34.87  124.40  335.08  466.75  581.04  618.81  712.25  735.89
        CCM mb enc |  34.71  123.96  333.92  465.39  579.91  617.49  711.45  734.92
        CCM mb dec |  34.42  122.80  331.02  462.81  578.28  616.42  709.88  734.19
      
      After (rfc4309(ccm-sm4-ce)):
      
      ccm-sm4-ce   |     16      64     256     512    1024    1420    4096    8192
      -------------+---------------------------------------------------------------
        CCM enc    |  77.12  249.82  569.94  725.17  839.27  867.71  952.87  969.89
        CCM dec    |  75.90  247.26  566.29  722.12  836.90  865.95  951.74  968.57
        CCM mb enc |  75.98  245.25  562.91  718.99  834.76  864.70  950.17  967.90
        CCM mb dec |  75.06  243.78  560.58  717.13  833.68  862.70  949.35  967.11
      Signed-off-by: default avatarTianjia Zhang <tianjia.zhang@linux.alibaba.com>
      Signed-off-by: default avatarHerbert Xu <herbert@gondor.apana.org.au>
      67fa3a7f
    • Tianjia Zhang's avatar
      crypto: arm64/sm4 - add CE implementation for cmac/xcbc/cbcmac · 6b5360a5
      Tianjia Zhang authored
      This patch is a CE-optimized assembly implementation for cmac/xcbc/cbcmac.
      
      Benchmark on T-Head Yitian-710 2.75 GHz, the data comes from the 300 mode of
      tcrypt, and compared the performance before and after this patch (the driver
      used before this patch is XXXmac(sm4-ce)). The abscissas are blocks of
      different lengths. The data is tabulated and the unit is Mb/s:
      
      Before:
      
      update-size    |      16      64     256    1024    2048    4096    8192
      ---------------+--------------------------------------------------------
      cmac(sm4-ce)   |  293.33  403.69  503.76  527.78  531.10  535.46  535.81
      xcbc(sm4-ce)   |  292.83  402.50  504.02  529.08  529.87  536.55  538.24
      cbcmac(sm4-ce) |  318.42  415.79  497.12  515.05  523.15  521.19  523.01
      
      After:
      
      update-size    |      16      64     256    1024    2048    4096    8192
      ---------------+--------------------------------------------------------
      cmac-sm4-ce    |  371.99  675.28  903.56  971.65  980.57  990.40  991.04
      xcbc-sm4-ce    |  372.11  674.55  903.47  971.61  980.96  990.42  991.10
      cbcmac-sm4-ce  |  371.63  675.33  903.23  972.07  981.42  990.93  991.45
      Signed-off-by: default avatarTianjia Zhang <tianjia.zhang@linux.alibaba.com>
      Signed-off-by: default avatarHerbert Xu <herbert@gondor.apana.org.au>
      6b5360a5
    • Tianjia Zhang's avatar
      crypto: arm64/sm4 - add CE implementation for XTS mode · 01f63311
      Tianjia Zhang authored
      This patch is a CE-optimized assembly implementation for XTS mode.
      
      Benchmark on T-Head Yitian-710 2.75 GHz, the data comes from the 218 mode of
      tcrypt, and compared the performance before and after this patch (the driver
      used before this patch is xts(ecb-sm4-ce)). The abscissas are blocks of
      different lengths. The data is tabulated and the unit is Mb/s:
      
      Before:
      
      xts(ecb-sm4-ce) |      16       64      128      256     1024     1420     4096
      ----------------+--------------------------------------------------------------
              XTS enc |  117.17   430.56   732.92  1134.98  2007.03  2136.23  2347.20
              XTS dec |  116.89   429.02   733.40  1132.96  2006.13  2130.50  2347.92
      
      After:
      
      xts-sm4-ce      |      16       64      128      256     1024     1420     4096
      ----------------+--------------------------------------------------------------
              XTS enc |  224.68   798.91  1248.08  1714.60  2413.73  2467.84  2612.62
              XTS dec |  229.85   791.34  1237.79  1720.00  2413.30  2473.84  2611.95
      Signed-off-by: default avatarTianjia Zhang <tianjia.zhang@linux.alibaba.com>
      Signed-off-by: default avatarHerbert Xu <herbert@gondor.apana.org.au>
      01f63311
    • Tianjia Zhang's avatar
      crypto: arm64/sm4 - add CE implementation for CTS-CBC mode · b1863fd0
      Tianjia Zhang authored
      This patch is a CE-optimized assembly implementation for CTS-CBC mode.
      
      Benchmark on T-Head Yitian-710 2.75 GHz, the data comes from the 218 mode of
      tcrypt, and compared the performance before and after this patch (the driver
      used before this patch is cts(cbc-sm4-ce)). The abscissas are blocks of
      different lengths. The data is tabulated and the unit is Mb/s:
      
      Before:
      
      cts(cbc-sm4-ce) |      16       64      128      256     1024     1420     4096
      ----------------+--------------------------------------------------------------
          CTS-CBC enc |  286.09   297.17   457.97   627.75   868.58   900.80   957.69
          CTS-CBC dec |  286.67   285.63   538.35   947.08  2241.03  2577.32  3391.14
      
      After:
      
      cts-cbc-sm4-ce  |      16       64      128      256     1024     1420     4096
      ----------------+--------------------------------------------------------------
          CTS-CBC enc |  288.19   428.80   593.57   741.04   911.73   931.80   950.00
          CTS-CBC dec |  292.22   468.99   838.23  1380.76  2741.17  3036.42  3409.62
      Signed-off-by: default avatarTianjia Zhang <tianjia.zhang@linux.alibaba.com>
      Signed-off-by: default avatarHerbert Xu <herbert@gondor.apana.org.au>
      b1863fd0
    • Tianjia Zhang's avatar
      crypto: arm64/sm4 - export reusable CE acceleration functions · 45089dbe
      Tianjia Zhang authored
      In the accelerated implementation of the SM4 algorithm using the Crypto
      Extension instructions, there are some functions that can be reused in
      the upcoming accelerated implementation of the GCM/CCM mode, and the
      CBC/CFB encryption is reused in the optimized implementation of SVESM4.
      Signed-off-by: default avatarTianjia Zhang <tianjia.zhang@linux.alibaba.com>
      Signed-off-by: default avatarHerbert Xu <herbert@gondor.apana.org.au>
      45089dbe
    • Tianjia Zhang's avatar
      crypto: arm64/sm4 - simplify sm4_ce_expand_key() of CE implementation · cb9ba02b
      Tianjia Zhang authored
      Use a 128-bit swap mask and tbl instruction to simplify the implementation
      for generating SM4 rkey_dec.
      
      Also fixed the issue of not being wrapped by kernel_neon_begin/end() when
      using the sm4_ce_expand_key() function.
      Signed-off-by: default avatarTianjia Zhang <tianjia.zhang@linux.alibaba.com>
      Signed-off-by: default avatarHerbert Xu <herbert@gondor.apana.org.au>
      cb9ba02b
    • Tianjia Zhang's avatar
      crypto: arm64/sm4 - refactor and simplify CE implementation · ce41fefd
      Tianjia Zhang authored
      This patch does not add new features, but only refactors and simplifies the
      implementation of the Crypto Extension acceleration of the SM4 algorithm:
      
      Extract the macro optimized by SM4 Crypto Extension for reuse in the
      subsequent optimization of CCM/GCM modes.
      
      Encryption in CBC and CFB modes processes four blocks at a time instead of
      one, allowing the ld1 instruction to load 64 bytes of data at a time, which
      will reduces unnecessary memory accesses.
      
      CBC/CFB/CTR makes full use of free registers to reduce redundant memory
      accesses, and rearranges some instructions to improve out-of-order execution
      capabilities.
      Signed-off-by: default avatarTianjia Zhang <tianjia.zhang@linux.alibaba.com>
      Signed-off-by: default avatarHerbert Xu <herbert@gondor.apana.org.au>
      ce41fefd
    • Tianjia Zhang's avatar
      crypto: tcrypt - add SM4 cts-cbc/xts/xcbc test · 3c383637
      Tianjia Zhang authored
      Added CTS-CBC/XTS/XCBC tests for SM4 algorithms, as well as
      corresponding speed tests, this is to test performance-optimized
      implementations of these modes.
      Signed-off-by: default avatarTianjia Zhang <tianjia.zhang@linux.alibaba.com>
      Signed-off-by: default avatarHerbert Xu <herbert@gondor.apana.org.au>
      3c383637
    • Tianjia Zhang's avatar
      crypto: testmgr - add SM4 cts-cbc/xts/xcbc test vectors · c24ee936
      Tianjia Zhang authored
      This patch newly adds the test vectors of CTS-CBC/XTS/XCBC modes of
      the SM4 algorithm, and also added some test vectors for SM4 GCM/CCM.
      Signed-off-by: default avatarTianjia Zhang <tianjia.zhang@linux.alibaba.com>
      Signed-off-by: default avatarHerbert Xu <herbert@gondor.apana.org.au>
      c24ee936
    • Tianjia Zhang's avatar
      crypto: arm64/sm4 - refactor and simplify NEON implementation · 62508017
      Tianjia Zhang authored
      This patch does not add new features. The main work is to refactor and
      simplify the implementation of SM4 NEON, which is reflected in the
      following aspects:
      
      The accelerated implementation supports the arbitrary number of blocks,
      not just multiples of 8, which simplifies the implementation and brings
      some optimization acceleration for data that is not aligned by 8 blocks.
      
      When loading the input data, use the ld4 instruction to replace the
      original ld1 instruction as much as possible, which will save the cost
      of matrix transposition of the input data.
      
      Use 8-block parallelism whenever possible to speed up matrix transpose
      and rotation operations, instead of up to 4-block parallelism.
      Signed-off-by: default avatarTianjia Zhang <tianjia.zhang@linux.alibaba.com>
      Signed-off-by: default avatarHerbert Xu <herbert@gondor.apana.org.au>
      62508017
    • Tianjia Zhang's avatar
      crypto: arm64/sm3 - add NEON assembly implementation · a41b2129
      Tianjia Zhang authored
      This patch adds the NEON acceleration implementation of the SM3 hash
      algorithm. The main algorithm is based on SM3 NEON accelerated work of
      the libgcrypt project.
      
      Benchmark on T-Head Yitian-710 2.75 GHz, the data comes from the 326 mode
      of tcrypt, and compares the performance data of sm3-generic and sm3-ce.
      The abscissas are blocks of different lengths. The data is tabulated and
      the unit is Mb/s:
      
      update-size    |      16      64     256    1024    2048    4096    8192
      ---------------+--------------------------------------------------------
      sm3-generic    |  185.24  221.28  301.26  307.43  300.83  308.82  308.91
      sm3-neon       |  171.81  220.20  322.94  339.28  334.09  343.61  343.87
      sm3-ce         |  227.48  333.48  502.62  527.87  520.45  534.91  535.40
      Signed-off-by: default avatarTianjia Zhang <tianjia.zhang@linux.alibaba.com>
      Signed-off-by: default avatarHerbert Xu <herbert@gondor.apana.org.au>
      a41b2129
    • Tianjia Zhang's avatar
      crypto: arm64/sm3 - raise the priority of the CE implementation · e1fa51aa
      Tianjia Zhang authored
      Raise the priority of the sm3-ce algorithm from 200 to 400, this is
      to make room for the implementation of sm3-neon.
      Signed-off-by: default avatarTianjia Zhang <tianjia.zhang@linux.alibaba.com>
      Signed-off-by: default avatarHerbert Xu <herbert@gondor.apana.org.au>
      e1fa51aa
    • Anirudh Venkataramanan's avatar
      crypto: tcrypt - Drop leading newlines from prints · 3513828c
      Anirudh Venkataramanan authored
      The top level print banners have a leading newline. It's not entirely
      clear why this exists, but it makes it harder to parse tcrypt test output
      using a script. Drop said newlines.
      
      tcrypt output before this patch:
      
      [...]
            testing speed of rfc4106(gcm(aes)) (rfc4106-gcm-aesni) encryption
      [...] test 0 (160 bit key, 16 byte blocks): 1 operation in 2320 cycles (16 bytes)
      
      tcrypt output with this patch:
      
      [...] testing speed of rfc4106(gcm(aes)) (rfc4106-gcm-aesni) encryption
      [...] test 0 (160 bit key, 16 byte blocks): 1 operation in 2320 cycles (16 bytes)
      Signed-off-by: default avatarAnirudh Venkataramanan <anirudh.venkataramanan@intel.com>
      Signed-off-by: default avatarHerbert Xu <herbert@gondor.apana.org.au>
      3513828c
    • Anirudh Venkataramanan's avatar
      crypto: tcrypt - Drop module name from print string · a2ef5630
      Anirudh Venkataramanan authored
      The pr_fmt() define includes KBUILD_MODNAME, and so there's no need
      for pr_err() to also print it. Drop module name from the print string.
      Signed-off-by: default avatarAnirudh Venkataramanan <anirudh.venkataramanan@intel.com>
      Signed-off-by: default avatarHerbert Xu <herbert@gondor.apana.org.au>
      a2ef5630
    • Anirudh Venkataramanan's avatar
      crypto: tcrypt - Use pr_info/pr_err · 837a99f5
      Anirudh Venkataramanan authored
      Currently, there's mixed use of printk() and pr_info()/pr_err(). The latter
      prints the module name (because pr_fmt() is defined so) but the former does
      not. As a result there's inconsistency in the printed output. For example:
      
      modprobe mode=211:
      
      [...] test 0 (160 bit key, 16 byte blocks): 1 operation in 2320 cycles (16 bytes)
      [...] test 1 (160 bit key, 64 byte blocks): 1 operation in 2336 cycles (64 bytes)
      
      modprobe mode=215:
      
      [...] tcrypt: test 0 (160 bit key, 16 byte blocks): 1 operation in 2173 cycles (16 bytes)
      [...] tcrypt: test 1 (160 bit key, 64 byte blocks): 1 operation in 2241 cycles (64 bytes)
      
      Replace all instances of printk() with pr_info()/pr_err() so that the
      module name is printed consistently.
      Signed-off-by: default avatarAnirudh Venkataramanan <anirudh.venkataramanan@intel.com>
      Signed-off-by: default avatarHerbert Xu <herbert@gondor.apana.org.au>
      837a99f5
    • Anirudh Venkataramanan's avatar
      crypto: tcrypt - Use pr_cont to print test results · fdaeb224
      Anirudh Venkataramanan authored
      For some test cases, a line break gets inserted between the test banner
      and the results. For example, with mode=211 this is the output:
      
      [...]
            testing speed of rfc4106(gcm(aes)) (rfc4106-gcm-aesni) encryption
      [...] test 0 (160 bit key, 16 byte blocks):
      [...] 1 operation in 2373 cycles (16 bytes)
      
      --snip--
      
      [...]
            testing speed of gcm(aes) (generic-gcm-aesni) encryption
      [...] test 0 (128 bit key, 16 byte blocks):
      [...] 1 operation in 2338 cycles (16 bytes)
      
      Similar behavior is seen in the following cases as well:
      
        modprobe tcrypt mode=212
        modprobe tcrypt mode=213
        modprobe tcrypt mode=221
        modprobe tcrypt mode=300 sec=1
        modprobe tcrypt mode=400 sec=1
      
      This doesn't happen with mode=215:
      
      [...] tcrypt:
                    testing speed of multibuffer rfc4106(gcm(aes)) (rfc4106-gcm-aesni) encryption
      [...] tcrypt: test 0 (160 bit key, 16 byte blocks): 1 operation in 2215 cycles (16 bytes)
      
      --snip--
      
      [...] tcrypt:
                    testing speed of multibuffer gcm(aes) (generic-gcm-aesni) encryption
      [...] tcrypt: test 0 (128 bit key, 16 byte blocks): 1 operation in 2191 cycles (16 bytes)
      
      This print inconsistency is because printk() is used instead of pr_cont()
      in a few places. Change these to be pr_cont().
      
      checkpatch warns that pr_cont() shouldn't be used. This can be ignored in
      this context as tcrypt already uses pr_cont().
      Signed-off-by: default avatarAnirudh Venkataramanan <anirudh.venkataramanan@intel.com>
      Signed-off-by: default avatarHerbert Xu <herbert@gondor.apana.org.au>
      fdaeb224
  2. 28 Oct, 2022 23 commits