1. 15 Feb, 2017 18 commits
  2. 11 Feb, 2017 17 commits
    • Rob Rice's avatar
      crypto: brcm - Add Broadcom SPU driver · 9d12ba86
      Rob Rice authored
      Add Broadcom Secure Processing Unit (SPU) crypto driver for SPU
      hardware crypto offload. The driver supports ablkcipher, ahash,
      and aead symmetric crypto operations.
      Signed-off-by: default avatarSteve Lin <steven.lin1@broadcom.com>
      Signed-off-by: default avatarRob Rice <rob.rice@broadcom.com>
      Signed-off-by: default avatarHerbert Xu <herbert@gondor.apana.org.au>
      9d12ba86
    • Rob Rice's avatar
      crypto: brcm - DT documentation for Broadcom SPU hardware · 206dc4fc
      Rob Rice authored
      Device tree documentation for Broadcom Secure Processing Unit
      (SPU) crypto hardware.
      Signed-off-by: default avatarSteve Lin <steven.lin1@broadcom.com>
      Signed-off-by: default avatarRob Rice <rob.rice@broadcom.com>
      Acked-by: default avatarRob Herring <robh@kernel.org>
      Signed-off-by: default avatarHerbert Xu <herbert@gondor.apana.org.au>
      206dc4fc
    • George Cherian's avatar
      crypto: cavium - Enable CPT options crypto for build · 62ad8b5c
      George Cherian authored
      Add the CPT options in crypto Kconfig and update the
      crypto Makefile
      
      Update the MAINTAINERS file too.
      Signed-off-by: default avatarGeorge Cherian <george.cherian@cavium.com>
      Reviewed-by: default avatarDavid Daney <david.daney@cavium.com>
      Signed-off-by: default avatarHerbert Xu <herbert@gondor.apana.org.au>
      62ad8b5c
    • George Cherian's avatar
      crypto: cavium - Add the Virtual Function driver for CPT · c694b233
      George Cherian authored
      Enable the CPT VF driver. CPT is the cryptographic Acceleration Unit
      in Octeon-tx series of processors.
      Signed-off-by: default avatarGeorge Cherian <george.cherian@cavium.com>
      Reviewed-by: default avatarDavid Daney <david.daney@cavium.com>
      Signed-off-by: default avatarHerbert Xu <herbert@gondor.apana.org.au>
      c694b233
    • George Cherian's avatar
      crypto: cavium - Add Support for Octeon-tx CPT Engine · 9e2c7d99
      George Cherian authored
      Enable the Physical Function driver for the Cavium Crypto Engine (CPT)
      found in Octeon-tx series of SoC's. CPT is the Cryptographic Accelaration
      Unit. CPT includes microcoded GigaCypher symmetric engines (SEs) and
      asymmetric engines (AEs).
      Signed-off-by: default avatarGeorge Cherian <george.cherian@cavium.com>
      Reviewed-by: default avatarDavid Daney <david.daney@cavium.com>
      Signed-off-by: default avatarHerbert Xu <herbert@gondor.apana.org.au>
      9e2c7d99
    • David Daney's avatar
      hwrng: cavium - Use per device name to allow for multiple devices. · 87f3d088
      David Daney authored
      Systems containing the Cavium HW RNG may have one device per NUMA
      node.  A typical configuration is a 2-node NUMA system, which results
      in 2 RNG devices.  The hwrng subsystem refuses (and rightly so) to
      register more than one device with he same name, so we get failure
      messages on these systems.
      
      Make the hwrng name unique by including the underlying device name.
      Also remove spaces from the name to make it possible to switch devices
      via the sysfs knobs.
      Signed-off-by: default avatarDavid Daney <david.daney@cavium.com>
      Signed-off-by: default avatarHerbert Xu <herbert@gondor.apana.org.au>
      87f3d088
    • Arnd Bergmann's avatar
      crypto: atmel - fix 64-bit build warnings · 4c147bcf
      Arnd Bergmann authored
      When we enable COMPILE_TEST building for the Atmel sha and tdes implementations,
      we run into a couple of warnings about incorrect format strings, e.g.
      
      In file included from include/linux/platform_device.h:14:0,
                       from drivers/crypto/atmel-sha.c:24:
      drivers/crypto/atmel-sha.c: In function 'atmel_sha_xmit_cpu':
      drivers/crypto/atmel-sha.c:571:19: error: format '%d' expects argument of type 'int', but argument 6 has type 'size_t {aka long unsigned int}' [-Werror=format=]
      In file included from include/linux/printk.h:6:0,
                       from include/linux/kernel.h:13,
                       from drivers/crypto/atmel-tdes.c:17:
      drivers/crypto/atmel-tdes.c: In function 'atmel_tdes_crypt_dma_stop':
      include/linux/kern_levels.h:4:18: error: format '%u' expects argument of type 'unsigned int', but argument 2 has type 'size_t {aka long unsigned int}' [-Werror=format=]
      
      These are all fixed by using the "%z" modifier for size_t data.
      
      There are also a few uses of min()/max() with incompatible types:
      
      drivers/crypto/atmel-tdes.c: In function 'atmel_tdes_crypt_start':
      drivers/crypto/atmel-tdes.c:528:181: error: comparison of distinct pointer types lacks a cast [-Werror]
      
      Where possible, we should use consistent types here, otherwise we can use
      min_t()/max_t() to get well-defined behavior without a warning.
      Signed-off-by: default avatarArnd Bergmann <arnd@arndb.de>
      Signed-off-by: default avatarHerbert Xu <herbert@gondor.apana.org.au>
      4c147bcf
    • Arnd Bergmann's avatar
      crypto: atmel - refine Kconfig dependencies · ceb4afb3
      Arnd Bergmann authored
      With the new authenc support, we get a harmless Kconfig warning:
      
      warning: (CRYPTO_DEV_ATMEL_AUTHENC) selects CRYPTO_DEV_ATMEL_SHA which has unmet direct dependencies (CRYPTO && CRYPTO_HW && ARCH_AT91)
      
      The problem is that each of the options has slightly different dependencies,
      although they all seem to want the same thing: allow building for real AT91
      targets that actually have the hardware, and possibly for compile testing.
      
      This makes all four options consistent: instead of depending on a particular
      dmaengine implementation, we depend on the ARM platform, CONFIG_COMPILE_TEST
      as an alternative when that is turned off. This makes the 'select' statements
      work correctly.
      
      Fixes: 89a82ef8 ("crypto: atmel-authenc - add support to authenc(hmac(shaX), Y(aes)) modes")
      Signed-off-by: default avatarArnd Bergmann <arnd@arndb.de>
      Signed-off-by: default avatarHerbert Xu <herbert@gondor.apana.org.au>
      ceb4afb3
    • Ard Biesheuvel's avatar
      crypto: algapi - make crypto_xor() and crypto_inc() alignment agnostic · db91af0f
      Ard Biesheuvel authored
      Instead of unconditionally forcing 4 byte alignment for all generic
      chaining modes that rely on crypto_xor() or crypto_inc() (which may
      result in unnecessary copying of data when the underlying hardware
      can perform unaligned accesses efficiently), make those functions
      deal with unaligned input explicitly, but only if the Kconfig symbol
      HAVE_EFFICIENT_UNALIGNED_ACCESS is set. This will allow us to drop
      the alignmasks from the CBC, CMAC, CTR, CTS, PCBC and SEQIV drivers.
      
      For crypto_inc(), this simply involves making the 4-byte stride
      conditional on HAVE_EFFICIENT_UNALIGNED_ACCESS being set, given that
      it typically operates on 16 byte buffers.
      
      For crypto_xor(), an algorithm is implemented that simply runs through
      the input using the largest strides possible if unaligned accesses are
      allowed. If they are not, an optimal sequence of memory accesses is
      emitted that takes the relative alignment of the input buffers into
      account, e.g., if the relative misalignment of dst and src is 4 bytes,
      the entire xor operation will be completed using 4 byte loads and stores
      (modulo unaligned bits at the start and end). Note that all expressions
      involving misalign are simply eliminated by the compiler when
      HAVE_EFFICIENT_UNALIGNED_ACCESS is defined.
      Signed-off-by: default avatarArd Biesheuvel <ard.biesheuvel@linaro.org>
      Signed-off-by: default avatarHerbert Xu <herbert@gondor.apana.org.au>
      db91af0f
    • Arnd Bergmann's avatar
      crypto: improve gcc optimization flags for serpent and wp512 · 7d6e9105
      Arnd Bergmann authored
      An ancient gcc bug (first reported in 2003) has apparently resurfaced
      on MIPS, where kernelci.org reports an overly large stack frame in the
      whirlpool hash algorithm:
      
      crypto/wp512.c:987:1: warning: the frame size of 1112 bytes is larger than 1024 bytes [-Wframe-larger-than=]
      
      With some testing in different configurations, I'm seeing large
      variations in stack frames size up to 1500 bytes for what should have
      around 300 bytes at most. I also checked the reference implementation,
      which is essentially the same code but also comes with some test and
      benchmarking infrastructure.
      
      It seems that recent compiler versions on at least arm, arm64 and powerpc
      have a partial fix for this problem, but enabling "-fsched-pressure", but
      even with that fix they suffer from the issue to a certain degree. Some
      testing on arm64 shows that the time needed to hash a given amount of
      data is roughly proportional to the stack frame size here, which makes
      sense given that the wp512 implementation is doing lots of loads for
      table lookups, and the problem with the overly large stack is a result
      of doing a lot more loads and stores for spilled registers (as seen from
      inspecting the object code).
      
      Disabling -fschedule-insns consistently fixes the problem for wp512,
      in my collection of cross-compilers, the results are consistently better
      or identical when comparing the stack sizes in this function, though
      some architectures (notable x86) have schedule-insns disabled by
      default.
      
      The four columns are:
      default: -O2
      press:	 -O2 -fsched-pressure
      nopress: -O2 -fschedule-insns -fno-sched-pressure
      nosched: -O2 -no-schedule-insns (disables sched-pressure)
      
      				default	press	nopress	nosched
      alpha-linux-gcc-4.9.3		1136	848	1136	176
      am33_2.0-linux-gcc-4.9.3	2100	2076	2100	2104
      arm-linux-gnueabi-gcc-4.9.3	848	848	1048	352
      cris-linux-gcc-4.9.3		272	272	272	272
      frv-linux-gcc-4.9.3		1128	1000	1128	280
      hppa64-linux-gcc-4.9.3		1128	336	1128	184
      hppa-linux-gcc-4.9.3		644	308	644	276
      i386-linux-gcc-4.9.3		352	352	352	352
      m32r-linux-gcc-4.9.3		720	656	720	268
      microblaze-linux-gcc-4.9.3	1108	604	1108	256
      mips64-linux-gcc-4.9.3		1328	592	1328	208
      mips-linux-gcc-4.9.3		1096	624	1096	240
      powerpc64-linux-gcc-4.9.3	1088	432	1088	160
      powerpc-linux-gcc-4.9.3		1080	584	1080	224
      s390-linux-gcc-4.9.3		456	456	624	360
      sh3-linux-gcc-4.9.3		292	292	292	292
      sparc64-linux-gcc-4.9.3		992	240	992	208
      sparc-linux-gcc-4.9.3		680	592	680	312
      x86_64-linux-gcc-4.9.3		224	240	272	224
      xtensa-linux-gcc-4.9.3		1152	704	1152	304
      
      aarch64-linux-gcc-7.0.0		224	224	1104	208
      arm-linux-gnueabi-gcc-7.0.1	824	824	1048	352
      mips-linux-gcc-7.0.0		1120	648	1120	272
      x86_64-linux-gcc-7.0.1		240	240	304	240
      
      arm-linux-gnueabi-gcc-4.4.7	840			392
      arm-linux-gnueabi-gcc-4.5.4	784	728	784	320
      arm-linux-gnueabi-gcc-4.6.4	736	728	736	304
      arm-linux-gnueabi-gcc-4.7.4	944	784	944	352
      arm-linux-gnueabi-gcc-4.8.5	464	464	760	352
      arm-linux-gnueabi-gcc-4.9.3	848	848	1048	352
      arm-linux-gnueabi-gcc-5.3.1	824	824	1064	336
      arm-linux-gnueabi-gcc-6.1.1	808	808	1056	344
      arm-linux-gnueabi-gcc-7.0.1	824	824	1048	352
      
      Trying the same test for serpent-generic, the picture is a bit different,
      and while -fno-schedule-insns is generally better here than the default,
      -fsched-pressure wins overall, so I picked that instead.
      
      				default	press	nopress	nosched
      alpha-linux-gcc-4.9.3		1392	864	1392	960
      am33_2.0-linux-gcc-4.9.3	536	524	536	528
      arm-linux-gnueabi-gcc-4.9.3	552	552	776	536
      cris-linux-gcc-4.9.3		528	528	528	528
      frv-linux-gcc-4.9.3		536	400	536	504
      hppa64-linux-gcc-4.9.3		524	208	524	480
      hppa-linux-gcc-4.9.3		768	472	768	508
      i386-linux-gcc-4.9.3		564	564	564	564
      m32r-linux-gcc-4.9.3		712	576	712	532
      microblaze-linux-gcc-4.9.3	724	392	724	512
      mips64-linux-gcc-4.9.3		720	384	720	496
      mips-linux-gcc-4.9.3		728	384	728	496
      powerpc64-linux-gcc-4.9.3	704	304	704	480
      powerpc-linux-gcc-4.9.3		704	296	704	480
      s390-linux-gcc-4.9.3		560	560	592	536
      sh3-linux-gcc-4.9.3		540	540	540	540
      sparc64-linux-gcc-4.9.3		544	352	544	496
      sparc-linux-gcc-4.9.3		544	344	544	496
      x86_64-linux-gcc-4.9.3		528	536	576	528
      xtensa-linux-gcc-4.9.3		752	544	752	544
      
      aarch64-linux-gcc-7.0.0		432	432	656	480
      arm-linux-gnueabi-gcc-7.0.1	616	616	808	536
      mips-linux-gcc-7.0.0		720	464	720	488
      x86_64-linux-gcc-7.0.1		536	528	600	536
      
      arm-linux-gnueabi-gcc-4.4.7	592			440
      arm-linux-gnueabi-gcc-4.5.4	776	448	776	544
      arm-linux-gnueabi-gcc-4.6.4	776	448	776	544
      arm-linux-gnueabi-gcc-4.7.4	768	448	768	544
      arm-linux-gnueabi-gcc-4.8.5	488	488	776	544
      arm-linux-gnueabi-gcc-4.9.3	552	552	776	536
      arm-linux-gnueabi-gcc-5.3.1	552	552	776	536
      arm-linux-gnueabi-gcc-6.1.1	560	560	776	536
      arm-linux-gnueabi-gcc-7.0.1	616	616	808	536
      
      I did not do any runtime tests with serpent, so it is possible that stack
      frame size does not directly correlate with runtime performance here and
      it actually makes things worse, but it's more likely to help here, and
      the reduced stack frame size is probably enough reason to apply the patch,
      especially given that the crypto code is often used in deep call chains.
      
      Link: https://kernelci.org/build/id/58797d7559b5149efdf6c3a9/logs/
      Link: http://www.larc.usp.br/~pbarreto/WhirlpoolPage.html
      Link: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=11488
      Link: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=79149
      Cc: Ralf Baechle <ralf@linux-mips.org>
      Signed-off-by: default avatarArnd Bergmann <arnd@arndb.de>
      Signed-off-by: default avatarHerbert Xu <herbert@gondor.apana.org.au>
      7d6e9105
    • Ard Biesheuvel's avatar
      crypto: arm64/aes - add NEON/Crypto Extensions CBCMAC/CMAC/XCBC driver · 4860620d
      Ard Biesheuvel authored
      On ARMv8 implementations that do not support the Crypto Extensions,
      such as the Raspberry Pi 3, the CCM driver falls back to the generic
      table based AES implementation to perform the MAC part of the
      algorithm, which is slow and not time invariant. So add a CBCMAC
      implementation to the shared glue code between NEON AES and Crypto
      Extensions AES, so that it can be used instead now that the CCM
      driver has been updated to look for CBCMAC implementations other
      than the one it supplies itself.
      
      Also, given how these algorithms mostly only differ in the way the key
      handling and the final encryption are implemented, expose CMAC and XCBC
      algorithms as well based on the same core update code.
      Signed-off-by: default avatarArd Biesheuvel <ard.biesheuvel@linaro.org>
      Signed-off-by: default avatarHerbert Xu <herbert@gondor.apana.org.au>
      4860620d
    • Ard Biesheuvel's avatar
      crypto: ccm - switch to separate cbcmac driver · f15f05b0
      Ard Biesheuvel authored
      Update the generic CCM driver to defer CBC-MAC processing to a
      dedicated CBC-MAC ahash transform rather than open coding this
      transform (and much of the associated scatterwalk plumbing) in
      the CCM driver itself.
      
      This cleans up the code considerably, but more importantly, it allows
      the use of alternative CBC-MAC implementations that don't suffer from
      performance degradation due to significant setup time (e.g., the NEON
      based AES code needs to enable/disable the NEON, and load the S-box
      into 16 SIMD registers, which cannot be amortized over the entire input
      when using the cipher interface)
      Signed-off-by: default avatarArd Biesheuvel <ard.biesheuvel@linaro.org>
      Signed-off-by: default avatarHerbert Xu <herbert@gondor.apana.org.au>
      f15f05b0
    • Ard Biesheuvel's avatar
      crypto: testmgr - add test cases for cbcmac(aes) · 092acf06
      Ard Biesheuvel authored
      In preparation of splitting off the CBC-MAC transform in the CCM
      driver into a separate algorithm, define some test cases for the
      AES incarnation of cbcmac.
      Signed-off-by: default avatarArd Biesheuvel <ard.biesheuvel@linaro.org>
      Signed-off-by: default avatarHerbert Xu <herbert@gondor.apana.org.au>
      092acf06
    • Ard Biesheuvel's avatar
      crypto: aes - add generic time invariant AES cipher · b5e0b032
      Ard Biesheuvel authored
      Lookup table based AES is sensitive to timing attacks, which is due to
      the fact that such table lookups are data dependent, and the fact that
      8 KB worth of tables covers a significant number of cachelines on any
      architecture, resulting in an exploitable correlation between the key
      and the processing time for known plaintexts.
      
      For network facing algorithms such as CTR, CCM or GCM, this presents a
      security risk, which is why arch specific AES ports are typically time
      invariant, either through the use of special instructions, or by using
      SIMD algorithms that don't rely on table lookups.
      
      For generic code, this is difficult to achieve without losing too much
      performance, but we can improve the situation significantly by switching
      to an implementation that only needs 256 bytes of table data (the actual
      S-box itself), which can be prefetched at the start of each block to
      eliminate data dependent latencies.
      
      This code encrypts at ~25 cycles per byte on ARM Cortex-A57 (while the
      ordinary generic AES driver manages 18 cycles per byte on this
      hardware). Decryption is substantially slower.
      Signed-off-by: default avatarArd Biesheuvel <ard.biesheuvel@linaro.org>
      Signed-off-by: default avatarHerbert Xu <herbert@gondor.apana.org.au>
      b5e0b032
    • Ard Biesheuvel's avatar
      crypto: aes-generic - drop alignment requirement · ec38a937
      Ard Biesheuvel authored
      The generic AES code exposes a 32-bit align mask, which forces all
      users of the code to use temporary buffers or take other measures to
      ensure the alignment requirement is adhered to, even on architectures
      that don't care about alignment for software algorithms such as this
      one.
      
      So drop the align mask, and fix the code to use get_unaligned_le32()
      where appropriate, which will resolve to whatever is optimal for the
      architecture.
      Signed-off-by: default avatarArd Biesheuvel <ard.biesheuvel@linaro.org>
      Signed-off-by: default avatarHerbert Xu <herbert@gondor.apana.org.au>
      ec38a937
    • Tim Chen's avatar
      crypto: sha512-mb - Protect sha512 mb ctx mgr access · c459bd7b
      Tim Chen authored
      The flusher and regular multi-buffer computation via mcryptd may race with another.
      Add here a lock and turn off interrupt to to access multi-buffer
      computation state cstate->mgr before a round of computation. This should
      prevent the flusher code jumping in.
      Signed-off-by: default avatarTim Chen <tim.c.chen@linux.intel.com>
      Signed-off-by: default avatarHerbert Xu <herbert@gondor.apana.org.au>
      c459bd7b
    • Ard Biesheuvel's avatar
      crypto: arm64/crc32 - merge CRC32 and PMULL instruction based drivers · 5d3d9c8b
      Ard Biesheuvel authored
      The PMULL based CRC32 implementation already contains code based on the
      separate, optional CRC32 instructions to fallback to when operating on
      small quantities of data. We can expose these routines directly on systems
      that lack the 64x64 PMULL instructions but do implement the CRC32 ones,
      which makes the driver that is based solely on those CRC32 instructions
      redundant. So remove it.
      
      Note that this aligns arm64 with ARM, whose accelerated CRC32 driver
      also combines the CRC32 extension based and the PMULL based versions.
      Signed-off-by: default avatarArd Biesheuvel <ard.biesheuvel@linaro.org>
      Tested-by: default avatarMatthias Brugger <mbrugger@suse.com>
      Signed-off-by: default avatarHerbert Xu <herbert@gondor.apana.org.au>
      5d3d9c8b
  3. 03 Feb, 2017 5 commits
    • Ard Biesheuvel's avatar
      crypto: arm/aes - don't use IV buffer to return final keystream block · 1a20b966
      Ard Biesheuvel authored
      The ARM bit sliced AES core code uses the IV buffer to pass the final
      keystream block back to the glue code if the input is not a multiple of
      the block size, so that the asm code does not have to deal with anything
      except 16 byte blocks. This is done under the assumption that the outgoing
      IV is meaningless anyway in this case, given that chaining is no longer
      possible under these circumstances.
      
      However, as it turns out, the CCM driver does expect the IV to retain
      a value that is equal to the original IV except for the counter value,
      and even interprets byte zero as a length indicator, which may result
      in memory corruption if the IV is overwritten with something else.
      
      So use a separate buffer to return the final keystream block.
      Signed-off-by: default avatarArd Biesheuvel <ard.biesheuvel@linaro.org>
      Signed-off-by: default avatarHerbert Xu <herbert@gondor.apana.org.au>
      1a20b966
    • Ard Biesheuvel's avatar
      crypto: arm64/aes - don't use IV buffer to return final keystream block · 88a3f582
      Ard Biesheuvel authored
      The arm64 bit sliced AES core code uses the IV buffer to pass the final
      keystream block back to the glue code if the input is not a multiple of
      the block size, so that the asm code does not have to deal with anything
      except 16 byte blocks. This is done under the assumption that the outgoing
      IV is meaningless anyway in this case, given that chaining is no longer
      possible under these circumstances.
      
      However, as it turns out, the CCM driver does expect the IV to retain
      a value that is equal to the original IV except for the counter value,
      and even interprets byte zero as a length indicator, which may result
      in memory corruption if the IV is overwritten with something else.
      
      So use a separate buffer to return the final keystream block.
      Signed-off-by: default avatarArd Biesheuvel <ard.biesheuvel@linaro.org>
      Signed-off-by: default avatarHerbert Xu <herbert@gondor.apana.org.au>
      88a3f582
    • Ard Biesheuvel's avatar
      crypto: arm64/aes - replace scalar fallback with plain NEON fallback · 12fcd923
      Ard Biesheuvel authored
      The new bitsliced NEON implementation of AES uses a fallback in two
      places: CBC encryption (which is strictly sequential, whereas this
      driver can only operate efficiently on 8 blocks at a time), and the
      XTS tweak generation, which involves encrypting a single AES block
      with a different key schedule.
      
      The plain (i.e., non-bitsliced) NEON code is more suitable as a fallback,
      given that it is faster than scalar on low end cores (which is what
      the NEON implementations target, since high end cores have dedicated
      instructions for AES), and shows similar behavior in terms of D-cache
      footprint and sensitivity to cache timing attacks. So switch the fallback
      handling to the plain NEON driver.
      Signed-off-by: default avatarArd Biesheuvel <ard.biesheuvel@linaro.org>
      Signed-off-by: default avatarHerbert Xu <herbert@gondor.apana.org.au>
      12fcd923
    • Ard Biesheuvel's avatar
      crypto: arm64/aes-neon-blk - tweak performance for low end cores · 4edd7d01
      Ard Biesheuvel authored
      The non-bitsliced AES implementation using the NEON is highly sensitive
      to micro-architectural details, and, as it turns out, the Cortex-A53 on
      the Raspberry Pi 3 is a core that can benefit from this code, given that
      its scalar AES performance is abysmal (32.9 cycles per byte).
      
      The new bitsliced AES code manages 19.8 cycles per byte on this core,
      but can only operate on 8 blocks at a time, which is not supported by
      all chaining modes. With a bit of tweaking, we can get the plain NEON
      code to run at 22.0 cycles per byte, making it useful for sequential
      modes like CBC encryption. (Like bitsliced NEON, the plain NEON
      implementation does not use any lookup tables, which makes it easy on
      the D-cache, and invulnerable to cache timing attacks)
      
      So tweak the plain NEON AES code to use tbl instructions rather than
      shl/sri pairs, and to avoid the need to reload permutation vectors or
      other constants from memory in every round. Also, improve the decryption
      performance by switching to 16x8 pmul instructions for the performing
      the multiplications in GF(2^8).
      
      To allow the ECB and CBC encrypt routines to be reused by the bitsliced
      NEON code in a subsequent patch, export them from the module.
      Signed-off-by: default avatarArd Biesheuvel <ard.biesheuvel@linaro.org>
      Signed-off-by: default avatarHerbert Xu <herbert@gondor.apana.org.au>
      4edd7d01
    • Ard Biesheuvel's avatar
      crypto: arm64/aes - performance tweak · c458c4ad
      Ard Biesheuvel authored
      Shuffle some instructions around in the __hround macro to shave off
      0.1 cycles per byte on Cortex-A57.
      Signed-off-by: default avatarArd Biesheuvel <ard.biesheuvel@linaro.org>
      Signed-off-by: default avatarHerbert Xu <herbert@gondor.apana.org.au>
      c458c4ad