• Arnd Bergmann's avatar
    crypto: improve gcc optimization flags for serpent and wp512 · 7d6e9105
    Arnd Bergmann authored
    An ancient gcc bug (first reported in 2003) has apparently resurfaced
    on MIPS, where kernelci.org reports an overly large stack frame in the
    whirlpool hash algorithm:
    
    crypto/wp512.c:987:1: warning: the frame size of 1112 bytes is larger than 1024 bytes [-Wframe-larger-than=]
    
    With some testing in different configurations, I'm seeing large
    variations in stack frames size up to 1500 bytes for what should have
    around 300 bytes at most. I also checked the reference implementation,
    which is essentially the same code but also comes with some test and
    benchmarking infrastructure.
    
    It seems that recent compiler versions on at least arm, arm64 and powerpc
    have a partial fix for this problem, but enabling "-fsched-pressure", but
    even with that fix they suffer from the issue to a certain degree. Some
    testing on arm64 shows that the time needed to hash a given amount of
    data is roughly proportional to the stack frame size here, which makes
    sense given that the wp512 implementation is doing lots of loads for
    table lookups, and the problem with the overly large stack is a result
    of doing a lot more loads and stores for spilled registers (as seen from
    inspecting the object code).
    
    Disabling -fschedule-insns consistently fixes the problem for wp512,
    in my collection of cross-compilers, the results are consistently better
    or identical when comparing the stack sizes in this function, though
    some architectures (notable x86) have schedule-insns disabled by
    default.
    
    The four columns are:
    default: -O2
    press:	 -O2 -fsched-pressure
    nopress: -O2 -fschedule-insns -fno-sched-pressure
    nosched: -O2 -no-schedule-insns (disables sched-pressure)
    
    				default	press	nopress	nosched
    alpha-linux-gcc-4.9.3		1136	848	1136	176
    am33_2.0-linux-gcc-4.9.3	2100	2076	2100	2104
    arm-linux-gnueabi-gcc-4.9.3	848	848	1048	352
    cris-linux-gcc-4.9.3		272	272	272	272
    frv-linux-gcc-4.9.3		1128	1000	1128	280
    hppa64-linux-gcc-4.9.3		1128	336	1128	184
    hppa-linux-gcc-4.9.3		644	308	644	276
    i386-linux-gcc-4.9.3		352	352	352	352
    m32r-linux-gcc-4.9.3		720	656	720	268
    microblaze-linux-gcc-4.9.3	1108	604	1108	256
    mips64-linux-gcc-4.9.3		1328	592	1328	208
    mips-linux-gcc-4.9.3		1096	624	1096	240
    powerpc64-linux-gcc-4.9.3	1088	432	1088	160
    powerpc-linux-gcc-4.9.3		1080	584	1080	224
    s390-linux-gcc-4.9.3		456	456	624	360
    sh3-linux-gcc-4.9.3		292	292	292	292
    sparc64-linux-gcc-4.9.3		992	240	992	208
    sparc-linux-gcc-4.9.3		680	592	680	312
    x86_64-linux-gcc-4.9.3		224	240	272	224
    xtensa-linux-gcc-4.9.3		1152	704	1152	304
    
    aarch64-linux-gcc-7.0.0		224	224	1104	208
    arm-linux-gnueabi-gcc-7.0.1	824	824	1048	352
    mips-linux-gcc-7.0.0		1120	648	1120	272
    x86_64-linux-gcc-7.0.1		240	240	304	240
    
    arm-linux-gnueabi-gcc-4.4.7	840			392
    arm-linux-gnueabi-gcc-4.5.4	784	728	784	320
    arm-linux-gnueabi-gcc-4.6.4	736	728	736	304
    arm-linux-gnueabi-gcc-4.7.4	944	784	944	352
    arm-linux-gnueabi-gcc-4.8.5	464	464	760	352
    arm-linux-gnueabi-gcc-4.9.3	848	848	1048	352
    arm-linux-gnueabi-gcc-5.3.1	824	824	1064	336
    arm-linux-gnueabi-gcc-6.1.1	808	808	1056	344
    arm-linux-gnueabi-gcc-7.0.1	824	824	1048	352
    
    Trying the same test for serpent-generic, the picture is a bit different,
    and while -fno-schedule-insns is generally better here than the default,
    -fsched-pressure wins overall, so I picked that instead.
    
    				default	press	nopress	nosched
    alpha-linux-gcc-4.9.3		1392	864	1392	960
    am33_2.0-linux-gcc-4.9.3	536	524	536	528
    arm-linux-gnueabi-gcc-4.9.3	552	552	776	536
    cris-linux-gcc-4.9.3		528	528	528	528
    frv-linux-gcc-4.9.3		536	400	536	504
    hppa64-linux-gcc-4.9.3		524	208	524	480
    hppa-linux-gcc-4.9.3		768	472	768	508
    i386-linux-gcc-4.9.3		564	564	564	564
    m32r-linux-gcc-4.9.3		712	576	712	532
    microblaze-linux-gcc-4.9.3	724	392	724	512
    mips64-linux-gcc-4.9.3		720	384	720	496
    mips-linux-gcc-4.9.3		728	384	728	496
    powerpc64-linux-gcc-4.9.3	704	304	704	480
    powerpc-linux-gcc-4.9.3		704	296	704	480
    s390-linux-gcc-4.9.3		560	560	592	536
    sh3-linux-gcc-4.9.3		540	540	540	540
    sparc64-linux-gcc-4.9.3		544	352	544	496
    sparc-linux-gcc-4.9.3		544	344	544	496
    x86_64-linux-gcc-4.9.3		528	536	576	528
    xtensa-linux-gcc-4.9.3		752	544	752	544
    
    aarch64-linux-gcc-7.0.0		432	432	656	480
    arm-linux-gnueabi-gcc-7.0.1	616	616	808	536
    mips-linux-gcc-7.0.0		720	464	720	488
    x86_64-linux-gcc-7.0.1		536	528	600	536
    
    arm-linux-gnueabi-gcc-4.4.7	592			440
    arm-linux-gnueabi-gcc-4.5.4	776	448	776	544
    arm-linux-gnueabi-gcc-4.6.4	776	448	776	544
    arm-linux-gnueabi-gcc-4.7.4	768	448	768	544
    arm-linux-gnueabi-gcc-4.8.5	488	488	776	544
    arm-linux-gnueabi-gcc-4.9.3	552	552	776	536
    arm-linux-gnueabi-gcc-5.3.1	552	552	776	536
    arm-linux-gnueabi-gcc-6.1.1	560	560	776	536
    arm-linux-gnueabi-gcc-7.0.1	616	616	808	536
    
    I did not do any runtime tests with serpent, so it is possible that stack
    frame size does not directly correlate with runtime performance here and
    it actually makes things worse, but it's more likely to help here, and
    the reduced stack frame size is probably enough reason to apply the patch,
    especially given that the crypto code is often used in deep call chains.
    
    Link: https://kernelci.org/build/id/58797d7559b5149efdf6c3a9/logs/
    Link: http://www.larc.usp.br/~pbarreto/WhirlpoolPage.html
    Link: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=11488
    Link: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=79149
    Cc: Ralf Baechle <ralf@linux-mips.org>
    Signed-off-by: default avatarArnd Bergmann <arnd@arndb.de>
    Signed-off-by: default avatarHerbert Xu <herbert@gondor.apana.org.au>
    7d6e9105
Makefile 5.52 KB