1. 04 Nov, 2015 3 commits
    • Linus Torvalds's avatar
      Merge branch 'perf-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip · b02ac6b1
      Linus Torvalds authored
      Pull perf updates from Ingo Molnar:
       "Kernel side changes:
      
         - Improve accuracy of perf/sched clock on x86.  (Adrian Hunter)
      
         - Intel DS and BTS updates.  (Alexander Shishkin)
      
         - Intel cstate PMU support.  (Kan Liang)
      
         - Add group read support to perf_event_read().  (Peter Zijlstra)
      
         - Branch call hardware sampling support, implemented on x86 and
           PowerPC.  (Stephane Eranian)
      
         - Event groups transactional interface enhancements.  (Sukadev
           Bhattiprolu)
      
         - Enable proper x86/intel/uncore PMU support on multi-segment PCI
           systems.  (Taku Izumi)
      
         - ... misc fixes and cleanups.
      
        The perf tooling team was very busy again with 200+ commits, the full
        diff doesn't fit into lkml size limits.  Here's an (incomplete) list
        of the tooling highlights:
      
        New features:
      
         - Change the default event used in all tools (record/top): use the
           most precise "cycles" hw counter available, i.e. when the user
           doesn't specify any event, it will try using cycles:ppp, cycles:pp,
           etc and fall back transparently until it finds a working counter.
           (Arnaldo Carvalho de Melo)
      
         - Integration of perf with eBPF that, given an eBPF .c source file
           (or .o file built for the 'bpf' target with clang), will get it
           automatically built, validated and loaded into the kernel via the
           sys_bpf syscall, which can then be used and seen using 'perf trace'
           and other tools.
      
           (Wang Nan)
      
        Various user interface improvements:
      
         - Automatic pager invocation on long help output.  (Namhyung Kim)
      
         - Search for more options when passing args to -h, e.g.: (Arnaldo
           Carvalho de Melo)
      
              $ perf report -h interface
      
              Usage: perf report [<options>]
      
               --gtk    Use the GTK2 interface
               --stdio  Use the stdio interface
               --tui    Use the TUI interface
      
         - Show ordered command line options when -h is used or when an
           unknown option is specified.  (Arnaldo Carvalho de Melo)
      
         - If options are passed after -h, show just its descriptions, not all
           options.  (Arnaldo Carvalho de Melo)
      
         - Implement column based horizontal scrolling in the hists browser
           (top, report), making it possible to use the TUI for things like
           'perf mem report' where there are many more columns than can fit in
           a terminal.  (Arnaldo Carvalho de Melo)
      
         - Enhance the error reporting of tracepoint event parsing, e.g.:
      
             $ oldperf record -e sched:sched_switc usleep 1
             event syntax error: 'sched:sched_switc'
                                  \___ unknown tracepoint
             Run 'perf list' for a list of valid events
      
           Now we get the much nicer:
      
             $ perf record -e sched:sched_switc ls
             event syntax error: 'sched:sched_switc'
                                  \___ can't access trace events
      
             Error: No permissions to read /sys/kernel/debug/tracing/events/sched/sched_switc
             Hint:  Try 'sudo mount -o remount,mode=755 /sys/kernel/debug'
      
           And after we have those mount point permissions fixed:
      
             $ perf record -e sched:sched_switc ls
             event syntax error: 'sched:sched_switc'
                                  \___ unknown tracepoint
      
             Error: File /sys/kernel/debug/tracing/events/sched/sched_switc not found.
             Hint:  Perhaps this kernel misses some CONFIG_ setting to enable this feature?.
      
           I.e.  basically now the event parsing routing uses the strerror_open()
           routines introduced by and used in 'perf trace' work.  (Jiri Olsa)
      
         - Fail properly when pattern matching fails to find a tracepoint,
           i.e. '-e non:existent' was being correctly handled, with a proper
           error message about that not being a valid event, but '-e
           non:existent*' wasn't, fix it.  (Jiri Olsa)
      
         - Do event name substring search as last resort in 'perf list'.
           (Arnaldo Carvalho de Melo)
      
           E.g.:
      
             # perf list clock
      
             List of pre-defined events (to be used in -e):
      
              cpu-clock                                          [Software event]
              task-clock                                         [Software event]
      
              uncore_cbox_0/clockticks/                          [Kernel PMU event]
              uncore_cbox_1/clockticks/                          [Kernel PMU event]
      
              kvm:kvm_pvclock_update                             [Tracepoint event]
              kvm:kvm_update_master_clock                        [Tracepoint event]
              power:clock_disable                                [Tracepoint event]
              power:clock_enable                                 [Tracepoint event]
              power:clock_set_rate                               [Tracepoint event]
              syscalls:sys_enter_clock_adjtime                   [Tracepoint event]
              syscalls:sys_enter_clock_getres                    [Tracepoint event]
              syscalls:sys_enter_clock_gettime                   [Tracepoint event]
              syscalls:sys_enter_clock_nanosleep                 [Tracepoint event]
              syscalls:sys_enter_clock_settime                   [Tracepoint event]
              syscalls:sys_exit_clock_adjtime                    [Tracepoint event]
              syscalls:sys_exit_clock_getres                     [Tracepoint event]
              syscalls:sys_exit_clock_gettime                    [Tracepoint event]
              syscalls:sys_exit_clock_nanosleep                  [Tracepoint event]
              syscalls:sys_exit_clock_settime                    [Tracepoint event]
      
        Intel PT hardware tracing enhancements:
      
         - Accept a zero --itrace period, meaning "as often as possible".  In
           the case of Intel PT that is the same as a period of 1 and a unit
           of 'instructions' (i.e.  --itrace=i1i).  (Adrian Hunter)
      
         - Harmonize itrace's synthesized callchains with the existing
           --max-stack tool option.  (Adrian Hunter)
      
         - Allow time to be displayed in nanoseconds in 'perf script'.
           (Adrian Hunter)
      
         - Fix potential infinite loop when handling Intel PT timestamps.
           (Adrian Hunter)
      
         - Slighly improve Intel PT debug logging.  (Adrian Hunter)
      
         - Warn when AUX data has been lost, just like when processing
           PERF_RECORD_LOST.  (Adrian Hunter)
      
         - Further document export-to-postgresql.py script.  (Adrian Hunter)
      
         - Add option to synthesize branch stack from auxtrace data.  (Adrian
           Hunter)
      
        Misc notable changes:
      
         - Switch the default callchain output mode to 'graph,0.5,caller', to
           make it look like the default for other tools, reducing the
           learning curve for people used to 'caller' based viewing.  (Arnaldo
           Carvalho de Melo)
      
         - various call chain usability enhancements.  (Namhyung Kim)
      
         - Introduce the 'P' event modifier, meaning 'max precision level,
           please', i.e.:
      
              $ perf record -e cycles:P usleep 1
      
           Is now similar to:
      
              $ perf record usleep 1
      
           Useful, for instance, when specifying multiple events.  (Jiri Olsa)
      
         - Add 'socket' sort entry, to sort by the processor socket in 'perf
           top' and 'perf report'.  (Kan Liang)
      
         - Introduce --socket-filter to 'perf report', for filtering by
           processor socket.  (Kan Liang)
      
         - Add new "Zoom into Processor Socket" operation in the perf hists
           browser, used in 'perf top' and 'perf report'.  (Kan Liang)
      
         - Allow probing on kmodules without DWARF.  (Masami Hiramatsu)
      
         - Fix 'perf probe -l' for probes added to kernel module functions.
           (Masami Hiramatsu)
      
         - Preparatory work for the 'perf stat record' feature that will allow
           generating perf.data files with counting data in addition to the
           sampling mode we have now (Jiri Olsa)
      
         - Update libtraceevent KVM plugin.  (Paolo Bonzini)
      
         - ... plus lots of other enhancements that I failed to list properly,
           by: Adrian Hunter, Alexander Shishkin, Andi Kleen, Andrzej Hajda,
           Arnaldo Carvalho de Melo, Dima Kogan, Don Zickus, Geliang Tang, He
           Kuang, Huaitong Han, Ingo Molnar, Jan Stancek, Jiri Olsa, Kan
           Liang, Kirill Tkhai, Masami Hiramatsu, Matt Fleming, Namhyung Kim,
           Paolo Bonzini, Peter Zijlstra, Rabin Vincent, Scott Wood, Stephane
           Eranian, Sukadev Bhattiprolu, Taku Izumi, Vaishali Thakkar, Wang
           Nan, Yang Shi and Yunlong Song"
      
      * 'perf-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (260 commits)
        perf unwind: Pass symbol source to libunwind
        tools build: Fix libiberty feature detection
        perf tools: Compile scriptlets to BPF objects when passing '.c' to --event
        perf record: Add clang options for compiling BPF scripts
        perf bpf: Attach eBPF filter to perf event
        perf tools: Make sure fixdep is built before libbpf
        perf script: Enable printing of branch stack
        perf trace: Add cmd string table to decode sys_bpf first arg
        perf bpf: Collect perf_evsel in BPF object files
        perf tools: Load eBPF object into kernel
        perf tools: Create probe points for BPF programs
        perf tools: Enable passing bpf object file to --event
        perf ebpf: Add the libbpf glue
        perf tools: Make perf depend on libbpf
        perf symbols: Fix endless loop in dso__split_kallsyms_for_kcore
        perf tools: Enable pre-event inherit setting by config terms
        perf symbols: we can now read separate debug-info files based on a build ID
        perf symbols: Fix type error when reading a build-id
        perf tools: Search for more options when passing args to -h
        perf stat: Cache aggregated map entries in extra cpumap
        ...
      b02ac6b1
    • Linus Torvalds's avatar
      atomic: remove all traces of READ_ONCE_CTRL() and atomic*_read_ctrl() · 105ff3cb
      Linus Torvalds authored
      This seems to be a mis-reading of how alpha memory ordering works, and
      is not backed up by the alpha architecture manual.  The helper functions
      don't do anything special on any other architectures, and the arguments
      that support them being safe on other architectures also argue that they
      are safe on alpha.
      
      Basically, the "control dependency" is between a previous read and a
      subsequent write that is dependent on the value read.  Even if the
      subsequent write is actually done speculatively, there is no way that
      such a speculative write could be made visible to other cpu's until it
      has been committed, which requires validating the speculation.
      
      Note that most weakely ordered architectures (very much including alpha)
      do not guarantee any ordering relationship between two loads that depend
      on each other on a control dependency:
      
          read A
          if (val == 1)
              read B
      
      because the conditional may be predicted, and the "read B" may be
      speculatively moved up to before reading the value A.  So we require the
      user to insert a smp_rmb() between the two accesses to be correct:
      
          read A;
          if (A == 1)
              smp_rmb()
              read B
      
      Alpha is further special in that it can break that ordering even if the
      *address* of B depends on the read of A, because the cacheline that is
      read later may be stale unless you have a memory barrier in between the
      pointer read and the read of the value behind a pointer:
      
          read ptr
          read offset(ptr)
      
      whereas all other weakly ordered architectures guarantee that the data
      dependency (as opposed to just a control dependency) will order the two
      accesses.  As a result, alpha needs a "smp_read_barrier_depends()" in
      between those two reads for them to be ordered.
      
      The coontrol dependency that "READ_ONCE_CTRL()" and "atomic_read_ctrl()"
      had was a control dependency to a subsequent *write*, however, and
      nobody can finalize such a subsequent write without having actually done
      the read.  And were you to write such a value to a "stale" cacheline
      (the way the unordered reads came to be), that would seem to lose the
      write entirely.
      
      So the things that make alpha able to re-order reads even more
      aggressively than other weak architectures do not seem to be relevant
      for a subsequent write.  Alpha memory ordering may be strange, but
      there's no real indication that it is *that* strange.
      
      Also, the alpha architecture reference manual very explicitly talks
      about the definition of "Dependence Constraints" in section 5.6.1.7,
      where a preceding read dominates a subsequent write.
      
      Such a dependence constraint admittedly does not impose a BEFORE (alpha
      architecture term for globally visible ordering), but it does guarantee
      that there can be no "causal loop".  I don't see how you could avoid
      such a loop if another cpu could see the stored value and then impact
      the value of the first read.  Put another way: the read and the write
      could not be seen as being out of order wrt other cpus.
      
      So I do not see how these "x_ctrl()" functions can currently be necessary.
      
      I may have to eat my words at some point, but in the absense of clear
      proof that alpha actually needs this, or indeed even an explanation of
      how alpha could _possibly_ need it, I do not believe these functions are
      called for.
      
      And if it turns out that alpha really _does_ need a barrier for this
      case, that barrier still should not be "smp_read_barrier_depends()".
      We'd have to make up some new speciality barrier just for alpha, along
      with the documentation for why it really is necessary.
      
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Paul E McKenney <paulmck@us.ibm.com>
      Cc: Dmitry Vyukov <dvyukov@google.com>
      Cc: Will Deacon <will.deacon@arm.com>
      Cc: Ingo Molnar <mingo@kernel.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      105ff3cb
    • Linus Torvalds's avatar
      Merge branch 'locking-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip · d63a9788
      Linus Torvalds authored
      Pull locking changes from Ingo Molnar:
       "The main changes in this cycle were:
      
         - More gradual enhancements to atomic ops: new atomic*_read_ctrl()
           ops, synchronize atomic_{read,set}() ordering requirements between
           architectures, add atomic_long_t bitops.  (Peter Zijlstra)
      
         - Add _{relaxed|acquire|release}() variants for inc/dec atomics and
           use them in various locking primitives: mutex, rtmutex, mcs, rwsem.
           This enables weakly ordered architectures (such as arm64) to make
           use of more locking related optimizations.  (Davidlohr Bueso)
      
         - Implement atomic[64]_{inc,dec}_relaxed() on ARM.  (Will Deacon)
      
         - Futex kernel data cache footprint micro-optimization.  (Rasmus
           Villemoes)
      
         - pvqspinlock runtime overhead micro-optimization.  (Waiman Long)
      
         - misc smaller fixlets"
      
      * 'locking-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
        ARM, locking/atomics: Implement _relaxed variants of atomic[64]_{inc,dec}
        locking/rwsem: Use acquire/release semantics
        locking/mcs: Use acquire/release semantics
        locking/rtmutex: Use acquire/release semantics
        locking/mutex: Use acquire/release semantics
        locking/asm-generic: Add _{relaxed|acquire|release}() variants for inc/dec atomics
        atomic: Implement atomic_read_ctrl()
        atomic, arch: Audit atomic_{read,set}()
        atomic: Add atomic_long_t bitops
        futex: Force hot variables into a single cache line
        locking/pvqspinlock: Kick the PV CPU unconditionally when _Q_SLOW_VAL
        locking/osq: Relax atomic semantics
        locking/qrwlock: Rename ->lock to ->wait_lock
        locking/Documentation/lockstat: Fix typo - lokcing -> locking
        locking/atomics, cmpxchg: Privatize the inclusion of asm/cmpxchg.h
      d63a9788
  2. 03 Nov, 2015 37 commits