1. 21 Mar, 2016 19 commits
    • Huang Rui's avatar
      perf/x86/amd/power: Add AMD accumulated power reporting mechanism · c7ab62bf
      Huang Rui authored
      Introduce an AMD accumlated power reporting mechanism for the Family
      15h, Model 60h processor that can be used to calculate the average
      power consumed by a processor during a measurement interval. The
      feature support is indicated by CPUID Fn8000_0007_EDX[12].
      
      This feature will be implemented both in hwmon and perf. The current
      design provides one event to report per package/processor power
      consumption by counting each compute unit power value.
      
      Here the gory details of how the computation is done:
      
      * Tsample: compute unit power accumulator sample period
      * Tref: the PTSC counter period (PTSC: performance timestamp counter)
      * N: the ratio of compute unit power accumulator sample period to the
        PTSC period
      
      * Jmax: max compute unit accumulated power which is indicated by
        MSR_C001007b[MaxCpuSwPwrAcc]
      
      * Jx/Jy: compute unit accumulated power which is indicated by
        MSR_C001007a[CpuSwPwrAcc]
      
      * Tx/Ty: the value of performance timestamp counter which is indicated
        by CU_PTSC MSR_C0010280[PTSC]
      * PwrCPUave: CPU average power
      
      i. Determine the ratio of Tsample to Tref by executing CPUID Fn8000_0007.
      	N = value of CPUID Fn8000_0007_ECX[CpuPwrSampleTimeRatio[15:0]].
      
      ii. Read the full range of the cumulative energy value from the new
          MSR MaxCpuSwPwrAcc.
      	Jmax = value returned.
      
      iii. At time x, software reads CpuSwPwrAcc and samples the PTSC.
      	Jx = value read from CpuSwPwrAcc and Tx = value read from PTSC.
      
      iv. At time y, software reads CpuSwPwrAcc and samples the PTSC.
      	Jy = value read from CpuSwPwrAcc and Ty = value read from PTSC.
      
      v. Calculate the average power consumption for a compute unit over
      time period (y-x). Unit of result is uWatt:
      
      	if (Jy < Jx) // Rollover has occurred
      		Jdelta = (Jy + Jmax) - Jx
      	else
      		Jdelta = Jy - Jx
      	PwrCPUave = N * Jdelta * 1000 / (Ty - Tx)
      
      Simple example:
      
        root@hr-zp:/home/ray/tip# ./tools/perf/perf stat -a -e 'power/power-pkg/' make -j4
          CHK     include/config/kernel.release
          CHK     include/generated/uapi/linux/version.h
          CHK     include/generated/utsrelease.h
          CHK     include/generated/timeconst.h
          CHK     include/generated/bounds.h
          CHK     include/generated/asm-offsets.h
          CALL    scripts/checksyscalls.sh
          CHK     include/generated/compile.h
          SKIPPED include/generated/compile.h
          Building modules, stage 2.
        Kernel: arch/x86/boot/bzImage is ready  (#40)
          MODPOST 4225 modules
      
         Performance counter stats for 'system wide':
      
                    183.44 mWatts power/power-pkg/
      
             341.837270111 seconds time elapsed
      
        root@hr-zp:/home/ray/tip# ./tools/perf/perf stat -a -e 'power/power-pkg/' sleep 10
      
         Performance counter stats for 'system wide':
      
                      0.18 mWatts power/power-pkg/
      
              10.012551815 seconds time elapsed
      Suggested-by: default avatarPeter Zijlstra <peterz@infradead.org>
      Suggested-by: default avatarIngo Molnar <mingo@kernel.org>
      Suggested-by: default avatarBorislav Petkov <bp@suse.de>
      Signed-off-by: default avatarHuang Rui <ray.huang@amd.com>
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Reviewed-by: default avatarThomas Gleixner <tglx@linutronix.de>
      Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
      Cc: Andy Lutomirski <luto@amacapital.net>
      Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
      Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Brian Gerst <brgerst@gmail.com>
      Cc: David Ahern <dsahern@gmail.com>
      Cc: Denys Vlasenko <dvlasenk@redhat.com>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Jiri Olsa <jolsa@redhat.com>
      Cc: Kan Liang <kan.liang@intel.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Namhyung Kim <namhyung@kernel.org>
      Cc: Robert Richter <rric@kernel.org>
      Cc: Stephane Eranian <eranian@google.com>
      Cc: Vince Weaver <vincent.weaver@maine.edu>
      Cc: jacob.w.shin@gmail.com
      Link: http://lkml.kernel.org/r/1457502306-2559-1-git-send-email-ray.huang@amd.com
      [ Fixed the modular build. ]
      Signed-off-by: default avatarIngo Molnar <mingo@kernel.org>
      c7ab62bf
    • Huang Rui's avatar
      x86/cpufeature, perf/x86: Add AMD Accumulated Power Mechanism feature flag · 01fe03ff
      Huang Rui authored
      AMD CPU family 15h model 0x60 introduces a mechanism for measuring
      accumulated power. It is used to report the processor power consumption
      and support for it is indicated by CPUID Fn8000_0007_EDX[12].
      Signed-off-by: default avatarHuang Rui <ray.huang@amd.com>
      Signed-off-by: default avatarBorislav Petkov <bp@suse.de>
      Cc: Aaron Lu <aaron.lu@intel.com>
      Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
      Cc: Andreas Herrmann <herrmann.der.user@googlemail.com>
      Cc: Andy Lutomirski <luto@amacapital.net>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Aravind Gopalakrishnan <Aravind.Gopalakrishnan@amd.com>
      Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Brian Gerst <brgerst@gmail.com>
      Cc: David Ahern <dsahern@gmail.com>
      Cc: Denys Vlasenko <dvlasenk@redhat.com>
      Cc: Fengguang Wu <fengguang.wu@intel.com>
      Cc: Frédéric Weisbecker <fweisbec@gmail.com>
      Cc: Guenter Roeck <linux@roeck-us.net>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Hector Marco-Gisbert <hecmargi@upv.es>
      Cc: Jacob Shin <jacob.w.shin@gmail.com>
      Cc: Jiri Olsa <jolsa@redhat.com>
      Cc: John Stultz <john.stultz@linaro.org>
      Cc: Kristen Carlson Accardi <kristen@linux.intel.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Namhyung Kim <namhyung@kernel.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Robert Richter <rric@kernel.org>
      Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
      Cc: Stephane Eranian <eranian@google.com>
      Cc: Suravee Suthikulpanit <suravee.suthikulpanit@amd.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Vince Weaver <vincent.weaver@maine.edu>
      Cc: Wan Zongshun <Vincent.Wan@amd.com>
      Cc: spg_linux_kernel@amd.com
      Link: http://lkml.kernel.org/r/1452739808-11871-4-git-send-email-ray.huang@amd.com
      [ Resolved conflict and moved the synthetic CPUID slot to 19. ]
      Signed-off-by: default avatarIngo Molnar <mingo@kernel.org>
      01fe03ff
    • Peter Zijlstra's avatar
      perf/core: Document some hotplug bits · 1dcaac1c
      Peter Zijlstra authored
      Document some of the hotplug notifier usage.
      Requested-by: default avatarThomas Gleixner <tglx@linutronix.de>
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
      Cc: Andy Lutomirski <luto@amacapital.net>
      Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Brian Gerst <brgerst@gmail.com>
      Cc: David Ahern <dsahern@gmail.com>
      Cc: Denys Vlasenko <dvlasenk@redhat.com>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Jiri Olsa <jolsa@redhat.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Namhyung Kim <namhyung@kernel.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Stephane Eranian <eranian@google.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Vince Weaver <vincent.weaver@maine.edu>
      Signed-off-by: default avatarIngo Molnar <mingo@kernel.org>
      1dcaac1c
    • Suravee Suthikulpanit's avatar
      perf/x86/amd: Add support for new IOMMU performance events · f8519155
      Suravee Suthikulpanit authored
      This patch adds new IOMMU performance event based on
      the information in table 74 of the AMD I/O Virtualization Technology
      (IOMMU) Specification (Document Id: 4882, Rev 2.62, Feb 2015)
      Signed-off-by: default avatarSuravee Suthikulpanit <suravee.suthikulpanit@amd.com>
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Reviewed-by: default avatarJoerg Roedel <jroedel@suse.de>
      Acked-by: default avatarJoerg Roedel <jroedel@suse.de>
      Cc: <acme@kernel.org>
      Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
      Cc: Andy Lutomirski <luto@amacapital.net>
      Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Brian Gerst <brgerst@gmail.com>
      Cc: David Ahern <dsahern@gmail.com>
      Cc: Denys Vlasenko <dvlasenk@redhat.com>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Jiri Olsa <jolsa@redhat.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Namhyung Kim <namhyung@kernel.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Stephane Eranian <eranian@google.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Vince Weaver <vincent.weaver@maine.edu>
      Link: http://support.amd.com/TechDocs/48882_IOMMU.pdfSigned-off-by: default avatarIngo Molnar <mingo@kernel.org>
      f8519155
    • Huang Rui's avatar
      perf/x86/amd: Move nodes_per_socket into bsp_init_amd() · 8dfeae0d
      Huang Rui authored
      nodes_per_socket is static and it needn't be initialized many
      times during every CPU core init. So move its initialization into
      bsp_init_amd().
      Signed-off-by: default avatarHuang Rui <ray.huang@amd.com>
      Signed-off-by: default avatarBorislav Petkov <bp@suse.de>
      Cc: Aaron Lu <aaron.lu@intel.com>
      Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
      Cc: Andreas Herrmann <herrmann.der.user@googlemail.com>
      Cc: Andy Lutomirski <luto@amacapital.net>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Aravind Gopalakrishnan <Aravind.Gopalakrishnan@amd.com>
      Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Brian Gerst <brgerst@gmail.com>
      Cc: David Ahern <dsahern@gmail.com>
      Cc: Denys Vlasenko <dvlasenk@redhat.com>
      Cc: Fengguang Wu <fengguang.wu@intel.com>
      Cc: Frédéric Weisbecker <fweisbec@gmail.com>
      Cc: Guenter Roeck <linux@roeck-us.net>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Hector Marco-Gisbert <hecmargi@upv.es>
      Cc: Jacob Shin <jacob.w.shin@gmail.com>
      Cc: Jiri Olsa <jolsa@redhat.com>
      Cc: John Stultz <john.stultz@linaro.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Namhyung Kim <namhyung@kernel.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Robert Richter <rric@kernel.org>
      Cc: Stephane Eranian <eranian@google.com>
      Cc: Suravee Suthikulpanit <suravee.suthikulpanit@amd.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Vince Weaver <vincent.weaver@maine.edu>
      Cc: spg_linux_kernel@amd.com
      Link: http://lkml.kernel.org/r/1452739808-11871-2-git-send-email-ray.huang@amd.comSigned-off-by: default avatarIngo Molnar <mingo@kernel.org>
      8dfeae0d
    • Peter Zijlstra's avatar
      perf/x86/cqm: Factor out some common code · 27348f38
      Peter Zijlstra authored
      Having the same code twice (and once quite ugly) is fragile.
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
      Cc: Andy Lutomirski <luto@amacapital.net>
      Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Brian Gerst <brgerst@gmail.com>
      Cc: David Ahern <dsahern@gmail.com>
      Cc: Denys Vlasenko <dvlasenk@redhat.com>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Jiri Olsa <jolsa@redhat.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Namhyung Kim <namhyung@kernel.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Stephane Eranian <eranian@google.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Vince Weaver <vincent.weaver@maine.edu>
      Signed-off-by: default avatarIngo Molnar <mingo@kernel.org>
      27348f38
    • Vikas Shivappa's avatar
      perf/x86/mbm: Add support for MBM counter overflow handling · e7ee3e8c
      Vikas Shivappa authored
      This patch adds a per package timer which periodically updates the
      memory bandwidth counters for the events that are currently active.
      
      Current patch has a periodic timer every 1s since the SDM guarantees
      that the counter will not overflow in 1s but this time can be definitely
      improved by calibrating on the system. The overflow is really a function
      of the max memory b/w that the socket can support, max counter value and
      scaling factor.
      Signed-off-by: default avatarVikas Shivappa <vikas.shivappa@linux.intel.com>
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Reviewed-by: default avatarTony Luck <tony.luck@intel.com>
      Acked-by: default avatarThomas Gleixner <tglx@linutronix.de>
      Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
      Cc: Andy Lutomirski <luto@amacapital.net>
      Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Brian Gerst <brgerst@gmail.com>
      Cc: David Ahern <dsahern@gmail.com>
      Cc: Denys Vlasenko <dvlasenk@redhat.com>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Jiri Olsa <jolsa@redhat.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Matt Fleming <matt@codeblueprint.co.uk>
      Cc: Namhyung Kim <namhyung@kernel.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Stephane Eranian <eranian@google.com>
      Cc: Vince Weaver <vincent.weaver@maine.edu>
      Cc: fenghua.yu@intel.com
      Cc: h.peter.anvin@intel.com
      Cc: ravi.v.shankar@intel.com
      Cc: vikas.shivappa@intel.com
      Link: http://lkml.kernel.org/r/013b756c5006b1c4ca411f3ecf43ed52f19fbf87.1457723885.git.tony.luck@intel.comSigned-off-by: default avatarIngo Molnar <mingo@kernel.org>
      e7ee3e8c
    • Vikas Shivappa's avatar
      perf/x86/mbm: Implement RMID recycling · 2d4de837
      Vikas Shivappa authored
      RMID could be allocated or deallocated as part of RMID recycling.
      
      When an RMID is allocated for MBM event, the MBM counter needs to be
      initialized because next time we read the counter we need the previous
      value to account for total bytes that went to the memory controller.
      
      Similarly, when RMID is deallocated we need to update the ->count
      variable.
      Signed-off-by: default avatarVikas Shivappa <vikas.shivappa@linux.intel.com>
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Reviewed-by: default avatarTony Luck <tony.luck@intel.com>
      Acked-by: default avatarThomas Gleixner <tglx@linutronix.de>
      Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
      Cc: Andy Lutomirski <luto@amacapital.net>
      Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Brian Gerst <brgerst@gmail.com>
      Cc: David Ahern <dsahern@gmail.com>
      Cc: Denys Vlasenko <dvlasenk@redhat.com>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Jiri Olsa <jolsa@redhat.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Matt Fleming <matt@codeblueprint.co.uk>
      Cc: Namhyung Kim <namhyung@kernel.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Stephane Eranian <eranian@google.com>
      Cc: Vince Weaver <vincent.weaver@maine.edu>
      Cc: fenghua.yu@intel.com
      Cc: h.peter.anvin@intel.com
      Cc: ravi.v.shankar@intel.com
      Cc: vikas.shivappa@intel.com
      Link: http://lkml.kernel.org/r/1457652732-4499-6-git-send-email-vikas.shivappa@linux.intel.comSigned-off-by: default avatarIngo Molnar <mingo@kernel.org>
      2d4de837
    • Tony Luck's avatar
      perf/x86/mbm: Add memory bandwidth monitoring event management · 87f01cc2
      Tony Luck authored
      Includes all the core infrastructure to measure the total_bytes and
      bandwidth.
      
      We have per socket counters for both total system wide L3 external
      bytes and local socket memory-controller bytes. The OS does MSR writes
      to MSR_IA32_QM_EVTSEL and MSR_IA32_QM_CTR to read the counters and
      uses the IA32_PQR_ASSOC_MSR to associate the RMID with the task. The
      tasks have a common RMID for CQM (cache quality of service monitoring)
      and MBM. Hence most of the scheduling code is reused from CQM.
      Signed-off-by: default avatarTony Luck <tony.luck@intel.com>
      [ Restructured rmid_read to not have an obvious hole, removed MBM_CNTR_MAX as its unused. ]
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Reviewed-by: default avatarVikas Shivappa <vikas.shivappa@linux.intel.com>
      Acked-by: default avatarThomas Gleixner <tglx@linutronix.de>
      Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
      Cc: Andy Lutomirski <luto@amacapital.net>
      Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Brian Gerst <brgerst@gmail.com>
      Cc: David Ahern <dsahern@gmail.com>
      Cc: Denys Vlasenko <dvlasenk@redhat.com>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Jiri Olsa <jolsa@redhat.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Matt Fleming <matt@codeblueprint.co.uk>
      Cc: Namhyung Kim <namhyung@kernel.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Stephane Eranian <eranian@google.com>
      Cc: Vince Weaver <vincent.weaver@maine.edu>
      Cc: fenghua.yu@intel.com
      Cc: h.peter.anvin@intel.com
      Cc: ravi.v.shankar@intel.com
      Cc: vikas.shivappa@intel.com
      Link: http://lkml.kernel.org/r/abd7aac9a18d93b95b985b931cf258df0164746d.1457723885.git.tony.luck@intel.comSigned-off-by: default avatarIngo Molnar <mingo@kernel.org>
      87f01cc2
    • Vikas Shivappa's avatar
      perf/x86/mbm: Add Intel Memory B/W Monitoring enumeration and init · 33c3cc7a
      Vikas Shivappa authored
      The MBM init patch enumerates the Intel MBM (Memory b/w monitoring)
      and initializes the perf events and datastructures for monitoring the
      memory b/w.
      
      Its based on original patch series by Tony Luck and Kanaka Juvva.
      
      Memory bandwidth monitoring (MBM) provides OS/VMM a way to monitor
      bandwidth from one level of cache to another. The current patches
      support L3 external bandwidth monitoring. It supports both 'local
      bandwidth' and 'total bandwidth' monitoring for the socket. Local
      bandwidth measures the amount of data sent through the memory controller
      on the socket and total b/w measures the total system bandwidth.
      
      Extending the cache quality of service monitoring (CQM) we add two
      more events to the perf infrastructure:
      
        intel_cqm_llc/local_bytes - bytes sent through local socket memory controller
        intel_cqm_llc/total_bytes - total L3 external bytes sent
      
      The tasks are associated with a Resouce Monitoring ID (RMID) just like
      in CQM and OS uses a MSR write to indicate the RMID of the task during
      scheduling.
      Signed-off-by: default avatarVikas Shivappa <vikas.shivappa@linux.intel.com>
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Reviewed-by: default avatarTony Luck <tony.luck@intel.com>
      Acked-by: default avatarThomas Gleixner <tglx@linutronix.de>
      Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
      Cc: Andy Lutomirski <luto@amacapital.net>
      Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Brian Gerst <brgerst@gmail.com>
      Cc: David Ahern <dsahern@gmail.com>
      Cc: Denys Vlasenko <dvlasenk@redhat.com>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Jiri Olsa <jolsa@redhat.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Matt Fleming <matt@codeblueprint.co.uk>
      Cc: Namhyung Kim <namhyung@kernel.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Stephane Eranian <eranian@google.com>
      Cc: Vince Weaver <vincent.weaver@maine.edu>
      Cc: fenghua.yu@intel.com
      Cc: h.peter.anvin@intel.com
      Cc: ravi.v.shankar@intel.com
      Cc: vikas.shivappa@intel.com
      Link: http://lkml.kernel.org/r/1457652732-4499-4-git-send-email-vikas.shivappa@linux.intel.comSigned-off-by: default avatarIngo Molnar <mingo@kernel.org>
      33c3cc7a
    • Vikas Shivappa's avatar
      perf/x86/cqm: Fix CQM memory leak and notifier leak · ada2f634
      Vikas Shivappa authored
      Fixes the hotcpu notifier leak and other global variable memory leaks
      during CQM (cache quality of service monitoring) initialization.
      Signed-off-by: default avatarVikas Shivappa <vikas.shivappa@linux.intel.com>
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Reviewed-by: default avatarTony Luck <tony.luck@intel.com>
      Acked-by: default avatarThomas Gleixner <tglx@linutronix.de>
      Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
      Cc: Andy Lutomirski <luto@amacapital.net>
      Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Brian Gerst <brgerst@gmail.com>
      Cc: David Ahern <dsahern@gmail.com>
      Cc: Denys Vlasenko <dvlasenk@redhat.com>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Jiri Olsa <jolsa@redhat.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Matt Fleming <matt@codeblueprint.co.uk>
      Cc: Namhyung Kim <namhyung@kernel.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Stephane Eranian <eranian@google.com>
      Cc: Vince Weaver <vincent.weaver@maine.edu>
      Cc: fenghua.yu@intel.com
      Cc: h.peter.anvin@intel.com
      Cc: ravi.v.shankar@intel.com
      Cc: vikas.shivappa@intel.com
      Link: http://lkml.kernel.org/r/1457652732-4499-3-git-send-email-vikas.shivappa@linux.intel.comSigned-off-by: default avatarIngo Molnar <mingo@kernel.org>
      ada2f634
    • Vikas Shivappa's avatar
      perf/x86/cqm: Fix CQM handling of grouping events into a cache_group · a223c1c7
      Vikas Shivappa authored
      Currently CQM (cache quality of service monitoring) is grouping all
      events belonging to same PID to use one RMID. However its not counting
      all of these different events. Hence we end up with a count of zero
      for all events other than the group leader.
      
      The patch tries to address the issue by keeping a flag in the
      perf_event.hw which has other CQM related fields. The field is updated
      at event creation and during grouping.
      Signed-off-by: default avatarVikas Shivappa <vikas.shivappa@linux.intel.com>
      [peterz: Changed hw_perf_event::is_group_event to an int]
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Reviewed-by: default avatarTony Luck <tony.luck@intel.com>
      Acked-by: default avatarThomas Gleixner <tglx@linutronix.de>
      Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
      Cc: Andy Lutomirski <luto@amacapital.net>
      Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Brian Gerst <brgerst@gmail.com>
      Cc: David Ahern <dsahern@gmail.com>
      Cc: Denys Vlasenko <dvlasenk@redhat.com>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Jiri Olsa <jolsa@redhat.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Matt Fleming <matt@codeblueprint.co.uk>
      Cc: Namhyung Kim <namhyung@kernel.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Stephane Eranian <eranian@google.com>
      Cc: Vince Weaver <vincent.weaver@maine.edu>
      Cc: fenghua.yu@intel.com
      Cc: h.peter.anvin@intel.com
      Cc: ravi.v.shankar@intel.com
      Cc: vikas.shivappa@intel.com
      Link: http://lkml.kernel.org/r/1457652732-4499-2-git-send-email-vikas.shivappa@linux.intel.comSigned-off-by: default avatarIngo Molnar <mingo@kernel.org>
      a223c1c7
    • Peter Zijlstra's avatar
      perf/core: Fix Undefined behaviour in rb_alloc() · 8184059e
      Peter Zijlstra authored
      Sasha reported:
      
       [ 3494.030114] UBSAN: Undefined behaviour in kernel/events/ring_buffer.c:685:22
       [ 3494.030647] shift exponent -1 is negative
      
      Andrey spotted that this is because:
      
        It happens if nr_pages = 0:
           rb->page_order = ilog2(nr_pages);
      
      Fix it by making both assignments conditional on nr_pages; since
      otherwise they should both be 0 anyway, and will be because of the
      kzalloc() used to allocate the structure.
      Reported-by: default avatarSasha Levin <sasha.levin@oracle.com>
      Reported-by: default avatarAndrey Ryabinin <ryabinin.a.a@gmail.com>
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
      Cc: Andy Lutomirski <luto@amacapital.net>
      Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
      Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Brian Gerst <brgerst@gmail.com>
      Cc: David Ahern <dsahern@gmail.com>
      Cc: Denys Vlasenko <dvlasenk@redhat.com>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Jiri Olsa <jolsa@redhat.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Namhyung Kim <namhyung@kernel.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Stephane Eranian <eranian@google.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Vince Weaver <vincent.weaver@maine.edu>
      Link: http://lkml.kernel.org/r/20160129141751.GA407@worktopSigned-off-by: default avatarIngo Molnar <mingo@kernel.org>
      8184059e
    • Peter Zijlstra's avatar
      perf/x86/BTS: Fix RCU usage · e8d8a90f
      Peter Zijlstra authored
      This splat reminds us:
      
      [ 8166.045595] [ INFO: suspicious RCU usage. ]
      
      [ 8166.168972]  [<ffffffff81127837>] lockdep_rcu_suspicious+0xe7/0x120
      [ 8166.175966]  [<ffffffff811e0bae>] perf_callchain+0x23e/0x250
      [ 8166.182280]  [<ffffffff811dda3d>] perf_prepare_sample+0x27d/0x350
      [ 8166.189082]  [<ffffffff8100f503>] intel_pmu_drain_bts_buffer+0x133/0x200
      
      ... that as the core code does, one should hold rcu_read_lock() over that
      entire BTS event-output generation sequence as well.
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
      Cc: Andy Lutomirski <luto@amacapital.net>
      Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Brian Gerst <brgerst@gmail.com>
      Cc: David Ahern <dsahern@gmail.com>
      Cc: Denys Vlasenko <dvlasenk@redhat.com>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Jiri Olsa <jolsa@redhat.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Namhyung Kim <namhyung@kernel.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Stephane Eranian <eranian@google.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Vince Weaver <vincent.weaver@maine.edu>
      Signed-off-by: default avatarIngo Molnar <mingo@kernel.org>
      e8d8a90f
    • Peter Zijlstra's avatar
      perf/core: Fix dynamic interrupt throttle · 91a612ee
      Peter Zijlstra authored
      There were two problems with the dynamic interrupt throttle mechanism,
      both triggered by the same action.
      
      When you (or perf_fuzzer) write a huge value into
      /proc/sys/kernel/perf_event_max_sample_rate the computed
      perf_sample_allowed_ns becomes 0. This effectively disables the whole
      dynamic throttle.
      
      This is fixed by ensuring update_perf_cpu_limits() never sets the
      value to 0. However, we allow disabling of the dynamic throttle by
      writing 100 to /proc/sys/kernel/perf_cpu_time_max_percent. This will
      generate a warning in dmesg.
      
      The second problem is that by setting the max_sample_rate to a huge
      number, the adaptive process can take a few tries, since it halfs the
      limit each time. Change that to directly compute a new value based on
      the observed duration.
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
      Cc: Andy Lutomirski <luto@amacapital.net>
      Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Brian Gerst <brgerst@gmail.com>
      Cc: David Ahern <dsahern@gmail.com>
      Cc: Denys Vlasenko <dvlasenk@redhat.com>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Jiri Olsa <jolsa@redhat.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Namhyung Kim <namhyung@kernel.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Stephane Eranian <eranian@google.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Vince Weaver <vincent.weaver@maine.edu>
      Signed-off-by: default avatarIngo Molnar <mingo@kernel.org>
      91a612ee
    • Peter Zijlstra's avatar
      perf/x86/ibs: Add IBS interrupt to the dynamic throttle · c2872d38
      Peter Zijlstra authored
      Interrupt throttling is normally only done against
      sysctl_perf_event_sample_rate. This means that if that number is too
      high (for whatever reason) you can lock up your machine.
      
      We have, however, a dynamic throttling scheme too, but for that to
      work, we need to add a callback to the interrupt handler, IBS did not
      have this, so add it.
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
      Cc: Andy Lutomirski <luto@amacapital.net>
      Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Brian Gerst <brgerst@gmail.com>
      Cc: David Ahern <dsahern@gmail.com>
      Cc: Denys Vlasenko <dvlasenk@redhat.com>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Jiri Olsa <jolsa@redhat.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Namhyung Kim <namhyung@kernel.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Stephane Eranian <eranian@google.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Vince Weaver <vincent.weaver@maine.edu>
      Signed-off-by: default avatarIngo Molnar <mingo@kernel.org>
      c2872d38
    • Peter Zijlstra's avatar
      perf/x86/ibs: Fix race with IBS_STARTING state · 5a50f529
      Peter Zijlstra authored
      While tracing the IBS bits I saw the NMI hitting between clearing
      IBS_STARTING and the actual MSR writes to disable the counter.
      
      Since IBS_STARTING was cleared, the handler assumed these were spurious
      NMIs and because STOPPING wasn't set yet either, insta-triggered an
      "Unknown NMI".
      
      Cure this by clearing IBS_STARTING after disabling the hardware.
      Tested-by: default avatarBorislav Petkov <bp@suse.de>
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
      Cc: Andy Lutomirski <luto@amacapital.net>
      Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Brian Gerst <brgerst@gmail.com>
      Cc: David Ahern <dsahern@gmail.com>
      Cc: Denys Vlasenko <dvlasenk@redhat.com>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Jiri Olsa <jolsa@redhat.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Namhyung Kim <namhyung@kernel.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Stephane Eranian <eranian@google.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Vince Weaver <vincent.weaver@maine.edu>
      Signed-off-by: default avatarIngo Molnar <mingo@kernel.org>
      5a50f529
    • Peter Zijlstra's avatar
      perf/x86/ibs: Fix IBS throttle · 0158b83f
      Peter Zijlstra authored
      When the IBS IRQ handler get a !0 return from perf_event_overflow;
      meaning it should throttle the event, it only disables it, it doesn't
      call perf_ibs_stop().
      
      This confuses the state machine, as we'll use pmu::start() ->
      perf_ibs_start() to unthrottle.
      Tested-by: default avatarBorislav Petkov <bp@alien8.de>
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
      Cc: Andy Lutomirski <luto@amacapital.net>
      Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
      Cc: Brian Gerst <brgerst@gmail.com>
      Cc: David Ahern <dsahern@gmail.com>
      Cc: Denys Vlasenko <dvlasenk@redhat.com>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Jiri Olsa <jolsa@redhat.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Namhyung Kim <namhyung@kernel.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Stephane Eranian <eranian@google.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Vince Weaver <vince@deater.net>
      Cc: Vince Weaver <vincent.weaver@maine.edu>
      Cc: dvyukov@google.com
      Cc: oleg@redhat.com
      Cc: panand@redhat.com
      Cc: sasha.levin@oracle.com
      Link: http://lkml.kernel.org/r/20160311142346.GE6344@twins.programming.kicks-ass.netSigned-off-by: default avatarIngo Molnar <mingo@kernel.org>
      0158b83f
    • Peter Zijlstra's avatar
      perf/core: Fix the unthrottle logic · 1e02cd40
      Peter Zijlstra authored
      Its possible to IOC_PERIOD while the event is throttled, this would
      re-start the event and the next tick would then try to unthrottle it,
      and find the event wasn't actually stopped anymore.
      
      This would tickle a WARN in the x86-pmu code which isn't expecting to
      start a !stopped event.
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Reviewed-by: default avatarAlexander Shishkin <alexander.shishkin@linux.intel.com>
      Cc: Andy Lutomirski <luto@amacapital.net>
      Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Brian Gerst <brgerst@gmail.com>
      Cc: David Ahern <dsahern@gmail.com>
      Cc: Denys Vlasenko <dvlasenk@redhat.com>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Jiri Olsa <jolsa@redhat.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Namhyung Kim <namhyung@kernel.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Stephane Eranian <eranian@google.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Vince Weaver <vincent.weaver@maine.edu>
      Cc: dvyukov@google.com
      Cc: oleg@redhat.com
      Cc: panand@redhat.com
      Cc: sasha.levin@oracle.com
      Cc: vince@deater.net
      Link: http://lkml.kernel.org/r/20160310143924.GR6356@twins.programming.kicks-ass.netSigned-off-by: default avatarIngo Molnar <mingo@kernel.org>
      1e02cd40
  2. 19 Mar, 2016 7 commits
  3. 18 Mar, 2016 3 commits
    • Thomas Gleixner's avatar
      x86/irq: Cure live lock in fixup_irqs() · 551adc60
      Thomas Gleixner authored
      Harry reported, that he's able to trigger a system freeze with cpu hot
      unplug. The freeze turned out to be a live lock caused by recent changes in
      irq_force_complete_move().
      
      When fixup_irqs() and from there irq_force_complete_move() is called on the
      dying cpu, then all other cpus are in stop machine an wait for the dying cpu
      to complete the teardown. If there is a move of an interrupt pending then
      irq_force_complete_move() sends the cleanup IPI to the cpus in the old_domain
      mask and waits for them to clear the mask. That's obviously impossible as
      those cpus are firmly stuck in stop machine with interrupts disabled.
      
      I should have known that, but I completely overlooked it being concentrated on
      the locking issues around the vectors. And the existance of the call to
      __irq_complete_move() in the code, which actually sends the cleanup IPI made
      it reasonable to wait for that cleanup to complete. That call was bogus even
      before the recent changes as it was just a pointless distraction.
      
      We have to look at two cases:
      
      1) The move_in_progress flag of the interrupt is set
      
         This means the ioapic has been updated with the new vector, but it has not
         fired yet. In theory there is a race:
      
         set_ioapic(new_vector) <-- Interrupt is raised before update is effective,
         			      i.e. it's raised on the old vector. 
      
         So if the target cpu cannot handle that interrupt before the old vector is
         cleaned up, we get a spurious interrupt and in the worst case the ioapic
         irq line becomes stale, but my experiments so far have only resulted in
         spurious interrupts.
      
         But in case of cpu hotplug this should be a non issue because if the
         affinity update happens right before all cpus rendevouz in stop machine,
         there is no way that the interrupt can be blocked on the target cpu because
         all cpus loops first with interrupts enabled in stop machine, so the old
         vector is not yet cleaned up when the interrupt fires.
      
         So the only way to run into this issue is if the delivery of the interrupt
         on the apic/system bus would be delayed beyond the point where the target
         cpu disables interrupts in stop machine. I doubt that it can happen, but at
         least there is a theroretical chance. Virtualization might be able to
         expose this, but AFAICT the IOAPIC emulation is not as stupid as the real
         hardware.
      
         I've spent quite some time over the weekend to enforce that situation,
         though I was not able to trigger the delayed case.
      
      2) The move_in_progress flag is not set and the old_domain cpu mask is not
         empty.
      
         That means, that an interrupt was delivered after the change and the
         cleanup IPI has been sent to the cpus in old_domain, but not all CPUs have
         responded to it yet.
      
      In both cases we can assume that the next interrupt will arrive on the new
      vector, so we can cleanup the old vectors on the cpus in the old_domain cpu
      mask.
      
      Fixes: 98229aa3 "x86/irq: Plug vector cleanup race"
      Reported-by: default avatarHarry Junior <harryjr@outlook.fr>
      Tested-by: default avatarTony Luck <tony.luck@intel.com>
      Signed-off-by: default avatarThomas Gleixner <tglx@linutronix.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Joe Lawrence <joe.lawrence@stratus.com>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Ben Hutchings <ben@decadent.org.uk>
      Cc: stable@vger.kernel.org
      Link: http://lkml.kernel.org/r/alpine.DEB.2.11.1603140931430.3657@nanosSigned-off-by: default avatarThomas Gleixner <tglx@linutronix.de>
      551adc60
    • Thomas Gleixner's avatar
      x86/tsc: Prevent NULL pointer deref in calibrate_delay_is_known() · f508a5ba
      Thomas Gleixner authored
      The topology_core_cpumask is used to find a neighbour cpu in
      calibrate_delay_is_known(). It might not be allocated at the first invocation
      of that function on the boot cpu, when CONFIG_CPUMASK_OFFSTACK is set.
      
      The mask is allocated later in native_smp_prepare_cpus. As a consequence the
      underlying find_next_bit() call dereferences a NULL pointer.
      
      Add a proper check to prevent this.
      
      Fixes: c25323c0 "x86/tsc: Use topology functions"
      Reported-and-tested-by: default avatarRichard W.M. Jones <rjones@redhat.com>
      Signed-off-by: default avatarThomas Gleixner <tglx@linutronix.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Josh Boyer <jwboyer@fedoraproject.org>
      Link: http://lkml.kernel.org/r/alpine.DEB.2.11.1603180843270.3978@nanosSigned-off-by: default avatarThomas Gleixner <tglx@linutronix.de>
      f508a5ba
    • Dave Jones's avatar
      x86/apic: Fix suspicious RCU usage in smp_trace_call_function_interrupt() · 7834c103
      Dave Jones authored
      Since 4.4, I've been able to trigger this occasionally:
      
      ===============================
      [ INFO: suspicious RCU usage. ]
      4.5.0-rc7-think+ #3 Not tainted
      Cc: Andi Kleen <ak@linux.intel.com>
      Link: http://lkml.kernel.org/r/20160315012054.GA17765@codemonkey.org.ukSigned-off-by: default avatarThomas Gleixner <tglx@linutronix.de>
      
      -------------------------------
      ./arch/x86/include/asm/msr-trace.h:47 suspicious rcu_dereference_check() usage!
      
      other info that might help us debug this:
      
      RCU used illegally from idle CPU!
      rcu_scheduler_active = 1, debug_locks = 1
      RCU used illegally from extended quiescent state!
      no locks held by swapper/3/0.
      
      stack backtrace:
      CPU: 3 PID: 0 Comm: swapper/3 Not tainted 4.5.0-rc7-think+ #3
       ffffffff92f821e0 1f3e5c340597d7fc ffff880468e07f10 ffffffff92560c2a
       ffff880462145280 0000000000000001 ffff880468e07f40 ffffffff921376a6
       ffffffff93665ea0 0000cc7c876d28da 0000000000000005 ffffffff9383dd60
      Call Trace:
       <IRQ>  [<ffffffff92560c2a>] dump_stack+0x67/0x9d
       [<ffffffff921376a6>] lockdep_rcu_suspicious+0xe6/0x100
       [<ffffffff925ae7a7>] do_trace_write_msr+0x127/0x1a0
       [<ffffffff92061c83>] native_apic_msr_eoi_write+0x23/0x30
       [<ffffffff92054408>] smp_trace_call_function_interrupt+0x38/0x360
       [<ffffffff92d1ca60>] trace_call_function_interrupt+0x90/0xa0
       <EOI>  [<ffffffff92ac5124>] ? cpuidle_enter_state+0x1b4/0x520
      
      Move the entering_irq() call before ack_APIC_irq(), because entering_irq()
      tells the RCU susbstems to end the extended quiescent state, so that the
      following trace call in ack_APIC_irq() works correctly.
      Suggested-by: default avatarAndi Kleen <ak@linux.intel.com>
      Fixes: 4787c368 "x86/tracing: Add irq_enter/exit() in smp_trace_reschedule_interrupt()"
      Signed-off-by: default avatarDave Jones <davej@codemonkey.org.uk>
      Signed-off-by: default avatarThomas Gleixner <tglx@linutronix.de>
      Cc: stable@vger.kernel.org
      7834c103
  4. 17 Mar, 2016 4 commits
  5. 16 Mar, 2016 2 commits
  6. 15 Mar, 2016 5 commits
    • Linus Torvalds's avatar
      Merge branch 'smp-hotplug-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip · 710d60cb
      Linus Torvalds authored
      Pull cpu hotplug updates from Thomas Gleixner:
       "This is the first part of the ongoing cpu hotplug rework:
      
         - Initial implementation of the state machine
      
         - Runs all online and prepare down callbacks on the plugged cpu and
           not on some random processor
      
         - Replaces busy loop waiting with completions
      
         - Adds tracepoints so the states can be followed"
      
      More detailed commentary on this work from an earlier email:
       "What's wrong with the current cpu hotplug infrastructure?
      
         - Asymmetry
      
           The hotplug notifier mechanism is asymmetric versus the bringup and
           teardown.  This is mostly caused by the notifier mechanism.
      
         - Largely undocumented dependencies
      
           While some notifiers use explicitely defined notifier priorities,
           we have quite some notifiers which use numerical priorities to
           express dependencies without any documentation why.
      
         - Control processor driven
      
           Most of the bringup/teardown of a cpu is driven by a control
           processor.  While it is understandable, that preperatory steps,
           like idle thread creation, memory allocation for and initialization
           of essential facilities needs to be done before a cpu can boot,
           there is no reason why everything else must run on a control
           processor.  Before this patch series, bringup looks like this:
      
             Control CPU                     Booting CPU
      
             do preparatory steps
             kick cpu into life
      
                                             do low level init
      
             sync with booting cpu           sync with control cpu
      
             bring the rest up
      
         - All or nothing approach
      
           There is no way to do partial bringups.  That's something which is
           really desired because we waste e.g.  at boot substantial amount of
           time just busy waiting that the cpu comes to life.  That's stupid
           as we could very well do preparatory steps and the initial IPI for
           other cpus and then go back and do the necessary low level
           synchronization with the freshly booted cpu.
      
         - Minimal debuggability
      
           Due to the notifier based design, it's impossible to switch between
           two stages of the bringup/teardown back and forth in order to test
           the correctness.  So in many hotplug notifiers the cancel
           mechanisms are either not existant or completely untested.
      
         - Notifier [un]registering is tedious
      
           To [un]register notifiers we need to protect against hotplug at
           every callsite.  There is no mechanism that bringup/teardown
           callbacks are issued on the online cpus, so every caller needs to
           do it itself.  That also includes error rollback.
      
        What's the new design?
      
           The base of the new design is a symmetric state machine, where both
           the control processor and the booting/dying cpu execute a well
           defined set of states.  Each state is symmetric in the end, except
           for some well defined exceptions, and the bringup/teardown can be
           stopped and reversed at almost all states.
      
           So the bringup of a cpu will look like this in the future:
      
             Control CPU                     Booting CPU
      
             do preparatory steps
             kick cpu into life
      
                                             do low level init
      
             sync with booting cpu           sync with control cpu
      
                                             bring itself up
      
           The synchronization step does not require the control cpu to wait.
           That mechanism can be done asynchronously via a worker or some
           other mechanism.
      
           The teardown can be made very similar, so that the dying cpu cleans
           up and brings itself down.  Cleanups which need to be done after
           the cpu is gone, can be scheduled asynchronously as well.
      
        There is a long way to this, as we need to refactor the notion when a
        cpu is available.  Today we set the cpu online right after it comes
        out of the low level bringup, which is not really correct.
      
        The proper mechanism is to set it to available, i.e. cpu local
        threads, like softirqd, hotplug thread etc. can be scheduled on that
        cpu, and once it finished all booting steps, it's set to online, so
        general workloads can be scheduled on it.  The reverse happens on
        teardown.  First thing to do is to forbid scheduling of general
        workloads, then teardown all the per cpu resources and finally shut it
        off completely.
      
        This patch series implements the basic infrastructure for this at the
        core level.  This includes the following:
      
         - Basic state machine implementation with well defined states, so
           ordering and prioritization can be expressed.
      
         - Interfaces to [un]register state callbacks
      
           This invokes the bringup/teardown callback on all online cpus with
           the proper protection in place and [un]installs the callbacks in
           the state machine array.
      
           For callbacks which have no particular ordering requirement we have
           a dynamic state space, so that drivers don't have to register an
           explicit hotplug state.
      
           If a callback fails, the code automatically does a rollback to the
           previous state.
      
         - Sysfs interface to drive the state machine to a particular step.
      
           This is only partially functional today.  Full functionality and
           therefor testability will be achieved once we converted all
           existing hotplug notifiers over to the new scheme.
      
         - Run all CPU_ONLINE/DOWN_PREPARE notifiers on the booting/dying
           processor:
      
             Control CPU                     Booting CPU
      
             do preparatory steps
             kick cpu into life
      
                                             do low level init
      
             sync with booting cpu           sync with control cpu
             wait for boot
                                             bring itself up
      
                                             Signal completion to control cpu
      
           In a previous step of this work we've done a full tree mechanical
           conversion of all hotplug notifiers to the new scheme.  The balance
           is a net removal of about 4000 lines of code.
      
           This is not included in this series, as we decided to take a
           different approach.  Instead of mechanically converting everything
           over, we will do a proper overhaul of the usage sites one by one so
           they nicely fit into the symmetric callback scheme.
      
           I decided to do that after I looked at the ugliness of some of the
           converted sites and figured out that their hotplug mechanism is
           completely buggered anyway.  So there is no point to do a
           mechanical conversion first as we need to go through the usage
           sites one by one again in order to achieve a full symmetric and
           testable behaviour"
      
      * 'smp-hotplug-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (23 commits)
        cpu/hotplug: Document states better
        cpu/hotplug: Fix smpboot thread ordering
        cpu/hotplug: Remove redundant state check
        cpu/hotplug: Plug death reporting race
        rcu: Make CPU_DYING_IDLE an explicit call
        cpu/hotplug: Make wait for dead cpu completion based
        cpu/hotplug: Let upcoming cpu bring itself fully up
        arch/hotplug: Call into idle with a proper state
        cpu/hotplug: Move online calls to hotplugged cpu
        cpu/hotplug: Create hotplug threads
        cpu/hotplug: Split out the state walk into functions
        cpu/hotplug: Unpark smpboot threads from the state machine
        cpu/hotplug: Move scheduler cpu_online notifier to hotplug core
        cpu/hotplug: Implement setup/removal interface
        cpu/hotplug: Make target state writeable
        cpu/hotplug: Add sysfs state interface
        cpu/hotplug: Hand in target state to _cpu_up/down
        cpu/hotplug: Convert the hotplugged cpu work to a state machine
        cpu/hotplug: Convert to a state machine for the control processor
        cpu/hotplug: Add tracepoints
        ...
      710d60cb
    • Linus Torvalds's avatar
      Merge branch 'irq-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip · df2e37c8
      Linus Torvalds authored
      Pull irq updates from Thomas Gleixner:
       "The 4.6 pile of irq updates contains:
      
         - Support for IPI irqdomains to support proper integration of IPIs to
           and from coprocessors.  The first user of this new facility is
           MIPS.  The relevant MIPS patches come with the core to avoid merge
           ordering issues and have been acked by Ralf.
      
         - A new command line option to set the default interrupt affinity
           mask at boot time.
      
         - Support for some more new ARM and MIPS interrupt controllers:
           tango, alpine-msix and bcm6345-l1
      
         - Two small cleanups for x86/apic which we merged into irq/core to
           avoid yet another branch in x86 with two tiny commits.
      
         - The usual set of updates, cleanups in drivers/irqchip.  Mostly in
           the area of ARM-GIC, arada-37-xp and atmel chips.  Nothing
           outstanding here"
      
      * 'irq-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (56 commits)
        irqchip/irq-alpine-msi: Release the correct domain on error
        irqchip/mxs: Fix error check of of_io_request_and_map()
        irqchip/sunxi-nmi: Fix error check of of_io_request_and_map()
        genirq: Export IRQ functions for module use
        irqchip/gic/realview: Support more RealView DCC variants
        Documentation/bindings: Document the Alpine MSIX driver
        irqchip: Add the Alpine MSIX interrupt controller
        irqchip/gic-v3: Always return IRQ_SET_MASK_OK_DONE in gic_set_affinity
        irqchip/gic-v3-its: Mark its_init() and its children as __init
        irqchip/gic-v3: Remove gic_root_node variable from the ITS code
        irqchip/gic-v3: ACPI: Add redistributor support via GICC structures
        irqchip/gic-v3: Add ACPI support for GICv3/4 initialization
        irqchip/gic-v3: Refactor gic_of_init() for GICv3 driver
        x86/apic: Deinline _flat_send_IPI_mask, save ~150 bytes
        x86/apic: Deinline __default_send_IPI_*, save ~200 bytes
        dt-bindings: interrupt-controller: Add SoC-specific compatible string to Marvell ODMI
        irqchip/mips-gic: Add new DT property to reserve IPIs
        MIPS: Delete smp-gic.c
        MIPS: Make smp CMP, CPS and MT use the new generic IPI functions
        MIPS: Add generic SMP IPI support
        ...
      df2e37c8
    • Linus Torvalds's avatar
      Merge branch 'timers-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip · 8a284c06
      Linus Torvalds authored
      Pull timer updates from Thomas Gleixner:
       "The timer department delivers this time:
      
         - Support for cross clock domain timestamps in the core code plus a
           first user.  That allows more precise timestamping for PTP and
           later for audio and other peripherals.
      
           The ptp/e1000e patches have been acked by the relevant maintainers
           and are carried in the timer tree to avoid merge ordering issues.
      
         - Support for unregistering the current clocksource watchdog.  That
           lifts a limitation for switching clocksources which has been there
           from day 1
      
         - The usual pile of fixes and updates to the core and the drivers.
           Nothing outstanding and exciting"
      
      * 'timers-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (26 commits)
        time/timekeeping: Work around false positive GCC warning
        e1000e: Adds hardware supported cross timestamp on e1000e nic
        ptp: Add PTP_SYS_OFFSET_PRECISE for driver crosstimestamping
        x86/tsc: Always Running Timer (ART) correlated clocksource
        hrtimer: Revert CLOCK_MONOTONIC_RAW support
        time: Add history to cross timestamp interface supporting slower devices
        time: Add driver cross timestamp interface for higher precision time synchronization
        time: Remove duplicated code in ktime_get_raw_and_real()
        time: Add timekeeping snapshot code capturing system time and counter
        time: Add cycles to nanoseconds translation
        jiffies: Use CLOCKSOURCE_MASK instead of constant
        clocksource: Introduce clocksource_freq2mult()
        clockevents/drivers/exynos_mct: Implement ->set_state_oneshot_stopped()
        clockevents/drivers/arm_global_timer: Implement ->set_state_oneshot_stopped()
        clockevents/drivers/arm_arch_timer: Implement ->set_state_oneshot_stopped()
        clocksource/drivers/arm_global_timer: Register delay timer
        clocksource/drivers/lpc32xx: Support timer-based ARM delay
        clocksource/drivers/lpc32xx: Support periodic mode
        clocksource/drivers/lpc32xx: Don't use the prescaler counter for clockevents
        clocksource/drivers/rockchip: Add err handle for rk_timer_init
        ...
      8a284c06
    • Linus Torvalds's avatar
      Merge branch 'core-rcu-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip · 208de214
      Linus Torvalds authored
      Pull RCU updates from Ingo Molnar:
       "The main changes in this cycle were:
      
         - Miscellaneous fixes, cleanups, restructuring.
      
         - RCU torture-test updates"
      
      * 'core-rcu-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
        rcu: Export rcu_gp_is_normal()
        rcu: Remove rcu_user_hooks_switch
        rcu: Catch up rcu_report_qs_rdp() comment with reality
        rcu: Document unique-name limitation for DEFINE_STATIC_SRCU()
        rcu: Make rcu/tiny_plugin.h explicitly non-modular
        irq: Privatize irq_common_data::state_use_accessors
        RCU: Privatize rcu_node::lock
        sparse: Add __private to privatize members of structs
        rcu: Remove useless rcu_data_p when !PREEMPT_RCU
        rcutorture: Correct no-expedite console messages
        rcu: Set rdp->gpwrap when CPU is idle
        rcu: Stop treating in-kernel CPU-bound workloads as errors
        rcu: Update rcu_report_qs_rsp() comment
        rcu: Assign false instead of 0 for ->core_needs_qs
        rcutorture: Check for self-detected stalls
        rcutorture: Don't keep empty console.log.diags files
        rcutorture: Add checks for rcutorture writer starvation
      208de214
    • Linus Torvalds's avatar
      Merge branch 'x86-timers-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip · ae465bee
      Linus Torvalds authored
      Pull x86 timer update from Ingo Molnar:
       "A single simplification of the x86 TSC code"
      
      * 'x86-timers-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
        x86/tsc: Use topology functions
      ae465bee