• Ingo Molnar's avatar
    x86: Align jump targets to 1-byte boundaries · be6cb027
    Ingo Molnar authored
    The following NOP in a hot function caught my attention:
    
      >   5a:	66 0f 1f 44 00 00    	nopw   0x0(%rax,%rax,1)
    
    That's a dead NOP that bloats the function a bit, added for the
    default 16-byte alignment that GCC applies for jump targets.
    
    I realize that x86 CPU manufacturers recommend 16-byte jump
    target alignments (it's in the Intel optimization manual),
    to help their relatively narrow decoder prefetch alignment
    and uop cache constraints, but the cost of that is very
    significant:
    
            text           data       bss         dec      filename
        12566391        1617840   1089536    15273767      vmlinux.align.16-byte
        12224951        1617840   1089536    14932327      vmlinux.align.1-byte
    
    By using 1-byte jump target alignment (i.e. no alignment at all)
    we get an almost 3% reduction in kernel size (!) - and a
    probably similar reduction in I$ footprint.
    
    Now, the usual justification for jump target alignment is the
    following:
    
     - modern decoders tend to have 16-byte (effective) decoder
       prefetch windows. (AMD documents it higher but measurements
       suggest the effective prefetch window on curretn uarchs is
       still around 16 bytes)
    
     - on Intel there's also the uop-cache with cachelines that have
       16-byte granularity and limited associativity.
    
     - older x86 uarchs had a penalty for decoder fetches that crossed
       16-byte boundaries. These limits are mostly gone from recent
       uarchs.
    
    So if a forward jump target is aligned to cacheline boundary then
    prefetches will start from a new prefetch-cacheline and there's
    higher chance for decoding in fewer steps and packing tightly.
    
    But I think that argument is flawed for typical optimized kernel
    code flows: forward jumps often go to 'cold' (uncommon) pieces
    of code, and  aligning cold code to cache lines does not bring a
    lot of advantages  (they are uncommon), while it causes
    collateral damage:
    
     - their alignment 'spreads out' the cache footprint, it shifts
       followup hot code further out
    
     - plus it slows down even 'cold' code that immediately follows 'hot'
       code (like in the above case), which could have benefited from the
       partial cacheline that comes off the end of hot code.
    
    But even in the cache-hot case the 16 byte alignment brings
    disadvantages:
    
     - it spreads out the cache footprint, possibly making the code
       fall out of the L1 I$.
    
     - On Intel CPUs, recent microarchitectures have plenty of
       uop cache (typically doubling every 3 years) - while the
       size of the L1 cache grows much less aggressively. So
       workloads are rarely uop cache limited.
    
    The only situation where alignment might matter are tight
    loops that could fit into a single 16 byte chunk - but those
    are pretty rare in the kernel: if they exist they tend
    to be pointer chasing or generic memory ops, which both tend
    to be cache miss (or cache allocation) intensive and are not
    decoder bandwidth limited.
    
    So the balance of arguments strongly favors packing kernel
    instructions tightly versus maximizing for decoder bandwidth:
    this patch changes the jump target alignment from 16 bytes
    to 1 byte (tightly packed, unaligned).
    Acked-by: default avatarDenys Vlasenko <dvlasenk@redhat.com>
    Cc: Andy Lutomirski <luto@amacapital.net>
    Cc: Aswin Chandramouleeswaran <aswin@hp.com>
    Cc: Borislav Petkov <bp@alien8.de>
    Cc: Brian Gerst <brgerst@gmail.com>
    Cc: Davidlohr Bueso <dave@stgolabs.net>
    Cc: H. Peter Anvin <hpa@zytor.com>
    Cc: Jason Low <jason.low2@hp.com>
    Cc: Linus Torvalds <torvalds@linux-foundation.org>
    Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
    Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
    Cc: Peter Zijlstra <peterz@infradead.org>
    Cc: Thomas Gleixner <tglx@linutronix.de>
    Cc: Tim Chen <tim.c.chen@linux.intel.com>
    Link: http://lkml.kernel.org/r/20150410120846.GA17101@gmail.comSigned-off-by: default avatarIngo Molnar <mingo@kernel.org>
    be6cb027
Makefile 9.3 KB