• Stewart Smith's avatar
    Use the POWER8 Micro Partition Prefetch Engine in KVM HV on POWER8 · 9678cdaa
    Stewart Smith authored
    The POWER8 processor has a Micro Partition Prefetch Engine, which is
    a fancy way of saying "has way to store and load contents of L2 or
    L2+MRU way of L3 cache". We initiate the storing of the log (list of
    addresses) using the logmpp instruction and start restore by writing
    to a SPR.
    
    The logmpp instruction takes parameters in a single 64bit register:
    - starting address of the table to store log of L2/L2+L3 cache contents
      - 32kb for L2
      - 128kb for L2+L3
      - Aligned relative to maximum size of the table (32kb or 128kb)
    - Log control (no-op, L2 only, L2 and L3, abort logout)
    
    We should abort any ongoing logging before initiating one.
    
    To initiate restore, we write to the MPPR SPR. The format of what to write
    to the SPR is similar to the logmpp instruction parameter:
    - starting address of the table to read from (same alignment requirements)
    - table size (no data, until end of table)
    - prefetch rate (from fastest possible to slower. about every 8, 16, 24 or
      32 cycles)
    
    The idea behind loading and storing the contents of L2/L3 cache is to
    reduce memory latency in a system that is frequently swapping vcores on
    a physical CPU.
    
    The best case scenario for doing this is when some vcores are doing very
    cache heavy workloads. The worst case is when they have about 0 cache hits,
    so we just generate needless memory operations.
    
    This implementation just does L2 store/load. In my benchmarks this proves
    to be useful.
    
    Benchmark 1:
     - 16 core POWER8
     - 3x Ubuntu 14.04LTS guests (LE) with 8 VCPUs each
     - No split core/SMT
     - two guests running sysbench memory test.
       sysbench --test=memory --num-threads=8 run
     - one guest running apache bench (of default HTML page)
       ab -n 490000 -c 400 http://localhost/
    
    This benchmark aims to measure performance of real world application (apache)
    where other guests are cache hot with their own workloads. The sysbench memory
    benchmark does pointer sized writes to a (small) memory buffer in a loop.
    
    In this benchmark with this patch I can see an improvement both in requests
    per second (~5%) and in mean and median response times (again, about 5%).
    The spread of minimum and maximum response times were largely unchanged.
    
    benchmark 2:
     - Same VM config as benchmark 1
     - all three guests running sysbench memory benchmark
    
    This benchmark aims to see if there is a positive or negative affect to this
    cache heavy benchmark. Although due to the nature of the benchmark (stores) we
    may not see a difference in performance, but rather hopefully an improvement
    in consistency of performance (when vcore switched in, don't have to wait
    many times for cachelines to be pulled in)
    
    The results of this benchmark are improvements in consistency of performance
    rather than performance itself. With this patch, the few outliers in duration
    go away and we get more consistent performance in each guest.
    
    benchmark 3:
     - same 3 guests and CPU configuration as benchmark 1 and 2.
     - two idle guests
     - 1 guest running STREAM benchmark
    
    This scenario also saw performance improvement with this patch. On Copy and
    Scale workloads from STREAM, I got 5-6% improvement with this patch. For
    Add and triad, it was around 10% (or more).
    
    benchmark 4:
     - same 3 guests as previous benchmarks
     - two guests running sysbench --memory, distinctly different cache heavy
       workload
     - one guest running STREAM benchmark.
    
    Similar improvements to benchmark 3.
    
    benchmark 5:
     - 1 guest, 8 VCPUs, Ubuntu 14.04
     - Host configured with split core (SMT8, subcores-per-core=4)
     - STREAM benchmark
    
    In this benchmark, we see a 10-20% performance improvement across the board
    of STREAM benchmark results with this patch.
    
    Based on preliminary investigation and microbenchmarks
    by Prerna Saxena <prerna@linux.vnet.ibm.com>
    Signed-off-by: default avatarStewart Smith <stewart@linux.vnet.ibm.com>
    Acked-by: default avatarPaul Mackerras <paulus@samba.org>
    Signed-off-by: default avatarAlexander Graf <agraf@suse.de>
    9678cdaa
cache.h 2.01 KB