• Michael Neuling's avatar
    powerpc/vdso: Avoid link stack corruption in __get_datapage() · c974809a
    Michael Neuling authored
    powerpc has a link register (lr) used for calling functions. We "bl
    <func>" to call a function, and "blr" to return back to the call site.
    
    The lr is only a single register, so if we call another function from
    inside this function (ie. nested calls), software must save away the
    lr on the software stack before calling the new function. Before
    returning (ie. before the "blr"), the lr is restored by software from
    the software stack.
    
    This makes branch prediction quite difficult for the processor as it
    will only know the branch target just before the "blr".
    
    To help with this, modern powerpc processors keep a (non-architected)
    hardware stack of lr called a "link stack". When a "bl <func>" is
    run, the lr is pushed onto this stack. When a "blr" is called, the
    branch predictor pops the lr value from the top of the link stack, and
    uses it to predict the branch target. Hence the processor pipeline
    knows a lot earlier the branch target.
    
    This works great but there are some cases where you call "bl" but
    without a matching "blr". Once such case is when trying to determine
    the program counter (which can't be read directly). Here you "bl+4;
    mflr" to get the program counter. If you do this, the link stack will
    get out of sync with reality, causing the branch predictor to
    mis-predict subsequent function returns.
    
    To avoid this, modern micro-architectures have a special case of bl.
    Using the form "bcl 20,31,+4", ensures the processor doesn't push to
    the link stack.
    
    The 32 and 64 bit variants of __get_datapage() use a "bl; mflr" to
    determine the loaded address of the VDSO. The current versions of
    these attempt to use this special bl variant.
    
    Unfortunately they use +8 rather than the required +4. Hence the
    current code results in the link stack getting out of sync with
    reality and hence the resulting performance degradation.
    
    This patch moves it to bcl+4 by moving __kernel_datapage_offset out of
    __get_datapage().
    
    With this patch, running a gettimeofday() (which uses
    __get_datapage()) microbenchmark we get a decent bump in performance
    on POWER7/8.
    
    For the benchmark in tools/testing/selftests/powerpc/benchmarks/gettimeofday.c
      POWER8:
        64bit gets ~4% improvement
        32bit gets ~9% improvement
      POWER7:
        64bit gets ~7% improvement
    Signed-off-by: default avatarMichael Neuling <mikey@neuling.org>
    Reported-by: default avatarAaron Sawdey <sawdey@us.ibm.com>
    Signed-off-by: default avatarMichael Ellerman <mpe@ellerman.id.au>
    c974809a
datapage.S 2.03 KB