• Kim Phillips's avatar
    perf/x86/amd: Update generic hardware cache events for Family 17h · 0e3b74e2
    Kim Phillips authored
    Add a new amd_hw_cache_event_ids_f17h assignment structure set
    for AMD families 17h and above, since a lot has changed.  Specifically:
    
    L1 Data Cache
    
    The data cache access counter remains the same on Family 17h.
    
    For DC misses, PMCx041's definition changes with Family 17h,
    so instead we use the L2 cache accesses from L1 data cache
    misses counter (PMCx060,umask=0xc8).
    
    For DC hardware prefetch events, Family 17h breaks compatibility
    for PMCx067 "Data Prefetcher", so instead, we use PMCx05a "Hardware
    Prefetch DC Fills."
    
    L1 Instruction Cache
    
    PMCs 0x80 and 0x81 (32-byte IC fetches and misses) are backward
    compatible on Family 17h.
    
    For prefetches, we remove the erroneous PMCx04B assignment which
    counts how many software data cache prefetch load instructions were
    dispatched.
    
    LL - Last Level Cache
    
    Removing PMCs 7D, 7E, and 7F assignments, as they do not exist
    on Family 17h, where the last level cache is L3.  L3 counters
    can be accessed using the existing AMD Uncore driver.
    
    Data TLB
    
    On Intel machines, data TLB accesses ("dTLB-loads") are assigned
    to counters that count load/store instructions retired.  This
    is inconsistent with instruction TLB accesses, where Intel
    implementations report iTLB misses that hit in the STLB.
    
    Ideally, dTLB-loads would count higher level dTLB misses that hit
    in lower level TLBs, and dTLB-load-misses would report those
    that also missed in those lower-level TLBs, therefore causing
    a page table walk.  That would be consistent with instruction
    TLB operation, remove the redundancy between dTLB-loads and
    L1-dcache-loads, and prevent perf from producing artificially
    low percentage ratios, i.e. the "0.01%" below:
    
            42,550,869      L1-dcache-loads
            41,591,860      dTLB-loads
                 4,802      dTLB-load-misses          #    0.01% of all dTLB cache hits
             7,283,682      L1-dcache-stores
             7,912,392      dTLB-stores
                   310      dTLB-store-misses
    
    On AMD Families prior to 17h, the "Data Cache Accesses" counter is
    used, which is slightly better than load/store instructions retired,
    but still counts in terms of individual load/store operations
    instead of TLB operations.
    
    So, for AMD Families 17h and higher, this patch assigns "dTLB-loads"
    to a counter for L1 dTLB misses that hit in the L2 dTLB, and
    "dTLB-load-misses" to a counter for L1 DTLB misses that caused
    L2 DTLB misses and therefore also caused page table walks.  This
    results in a much more accurate view of data TLB performance:
    
            60,961,781      L1-dcache-loads
                 4,601      dTLB-loads
                   963      dTLB-load-misses          #   20.93% of all dTLB cache hits
    
    Note that for all AMD families, data loads and stores are combined
    in a single accesses counter, so no 'L1-dcache-stores' are reported
    separately, and stores are counted with loads in 'L1-dcache-loads'.
    
    Also note that the "% of all dTLB cache hits" string is misleading
    because (a) "dTLB cache": although TLBs can be considered caches for
    page tables, in this context, it can be misinterpreted as data cache
    hits because the figures are similar (at least on Intel), and (b) not
    all those loads (technically accesses) technically "hit" at that
    hardware level.  "% of all dTLB accesses" would be more clear/accurate.
    
    Instruction TLB
    
    On Intel machines, 'iTLB-loads' measure iTLB misses that hit in the
    STLB, and 'iTLB-load-misses' measure iTLB misses that also missed in
    the STLB and completed a page table walk.
    
    For AMD Family 17h and above, for 'iTLB-loads' we replace the
    erroneous instruction cache fetches counter with PMCx084
    "L1 ITLB Miss, L2 ITLB Hit".
    
    For 'iTLB-load-misses' we still use PMCx085 "L1 ITLB Miss,
    L2 ITLB Miss", but set a 0xff umask because without it the event
    does not get counted.
    
    Branch Predictor (BPU)
    
    PMCs 0xc2 and 0xc3 continue to be valid across all AMD Families.
    
    Node Level Events
    
    Family 17h does not have a PMCx0e9 counter, and corresponding counters
    have not been made available publicly, so for now, we mark them as
    unsupported for Families 17h and above.
    
    Reference:
    
      "Open-Source Register Reference For AMD Family 17h Processors Models 00h-2Fh"
      Released 7/17/2018, Publication #56255, Revision 3.03:
      https://www.amd.com/system/files/TechDocs/56255_OSRR.pdf
    
    [ mingo: tidied up the line breaks. ]
    Signed-off-by: default avatarKim Phillips <kim.phillips@amd.com>
    Cc: <stable@vger.kernel.org> # v4.9+
    Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
    Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
    Cc: Borislav Petkov <bp@alien8.de>
    Cc: H. Peter Anvin <hpa@zytor.com>
    Cc: Janakarajan Natarajan <Janakarajan.Natarajan@amd.com>
    Cc: Jiri Olsa <jolsa@redhat.com>
    Cc: Linus Torvalds <torvalds@linux-foundation.org>
    Cc: Martin Liška <mliska@suse.cz>
    Cc: Namhyung Kim <namhyung@kernel.org>
    Cc: Peter Zijlstra <peterz@infradead.org>
    Cc: Pu Wen <puwen@hygon.cn>
    Cc: Stephane Eranian <eranian@google.com>
    Cc: Suravee Suthikulpanit <Suravee.Suthikulpanit@amd.com>
    Cc: Thomas Gleixner <tglx@linutronix.de>
    Cc: Thomas Lendacky <Thomas.Lendacky@amd.com>
    Cc: Vince Weaver <vincent.weaver@maine.edu>
    Cc: linux-kernel@vger.kernel.org
    Cc: linux-perf-users@vger.kernel.org
    Fixes: e40ed154 ("perf/x86: Add perf support for AMD family-17h processors")
    Signed-off-by: default avatarIngo Molnar <mingo@kernel.org>
    0e3b74e2
core.c 25.4 KB