• Martin Willi's avatar
    crypto: x86/chacha20 - Add a 4-block AVX2 variant · 8a5a79d5
    Martin Willi authored
    This variant builds upon the idea of the 2-block AVX2 variant that
    shuffles words after each round. The shuffling has a rather high latency,
    so the arithmetic units are not optimally used.
    
    Given that we have plenty of registers in AVX, this version parallelizes
    the 2-block variant to do four blocks. While the first two blocks are
    shuffling, the CPU can do the XORing on the second two blocks and
    vice-versa, which makes this version much faster than the SSSE3 variant
    for four blocks. The latter is now mostly for systems that do not have
    AVX2, but there it is the work-horse, so we keep it in place.
    
    The partial XORing function trailer is very similar to the AVX2 2-block
    variant. While it could be shared, that code segment is rather short;
    profiling is also easier with the trailer integrated, so we keep it per
    function.
    Signed-off-by: default avatarMartin Willi <martin@strongswan.org>
    Signed-off-by: default avatarHerbert Xu <herbert@gondor.apana.org.au>
    8a5a79d5
chacha20-avx2-x86_64.S 24.6 KB