1. 10 Apr, 2016 1 commit
    • Rafael J. Wysocki's avatar
      intel_pstate: Avoid getting stuck in high P-states when idle · ffb81056
      Rafael J. Wysocki authored
      Jörg Otte reports that commit a4675fbc (cpufreq: intel_pstate:
      Replace timers with utilization update callbacks) caused the CPUs in
      his Haswell-based system to stay in the very high frequency region
      even if the system is completely idle.
      
      That turns out to be an existing problem in the intel_pstate driver's
      P-state selection algorithm for Core processors.  Namely, all
      decisions made by that algorithm are based on the average frequency
      of the CPU between sampling events and on the P-state requested on
      the last invocation, so it may get stuck at a very hight frequency
      even if the utilization of the CPU is very low (in fact, it may get
      stuck in a inadequate P-state regardless of the CPU utilization).
      The only way to kick it out of that limbo is a sufficiently long idle
      period (3 times longer than the prescribed sampling interval), but if
      that doesn't happen often enough (eg. due to a timing change like
      after the above commit), the P-state of the CPU may be inadequate
      pretty much all the time.
      
      To address the most egregious manifestations of that issue, reset the
      core_busy value used to determine the next P-state to request if the
      utilization of the CPU, determined with the help of the MPERF
      feedback register and the TSC, is below 1%.
      
      Link: https://bugzilla.kernel.org/show_bug.cgi?id=115771Reported-and-tested-by: default avatarJörg Otte <jrg.otte@gmail.com>
      Signed-off-by: default avatarRafael J. Wysocki <rafael.j.wysocki@intel.com>
      ffb81056
  2. 05 Apr, 2016 3 commits
  3. 01 Apr, 2016 1 commit
    • Rafael J. Wysocki's avatar
      intel_pstate: Avoid extra invocation of intel_pstate_sample() · febce40f
      Rafael J. Wysocki authored
      The initialization of intel_pstate for a given CPU involves populating
      the fields of its struct cpudata that represent the previous sample,
      but currently that is done in a problematic way.
      
      Namely, intel_pstate_init_cpu() makes an extra call to
      intel_pstate_sample() so it reads the current register values that
      will be used to populate the "previous sample" record during the
      next invocation of intel_pstate_sample().  However, after commit
      a4675fbc (cpufreq: intel_pstate: Replace timers with utilization
      update callbacks) that doesn't work for last_sample_time, because
      the time value is passed to intel_pstate_sample() as an argument now.
      Passing 0 to it from intel_pstate_init_cpu() is problematic, because
      that causes cpu->last_sample_time == 0 to be visible in
      get_target_pstate_use_performance() (and hence the extra
      cpu->last_sample_time > 0 check in there) and effectively allows
      the first invocation of intel_pstate_sample() from
      intel_pstate_update_util() to happen immediately after the
      initialization which may lead to a significant "turn on"
      effect in the governor algorithm.
      
      To mitigate that issue, rework the initialization to avoid the
      extra intel_pstate_sample() call from intel_pstate_init_cpu().
      Instead, make intel_pstate_sample() return false if it has been
      called with cpu->sample.time equal to zero, which will make
      intel_pstate_update_util() skip the sample in that case, and
      reset cpu->sample.time from intel_pstate_set_update_util_hook()
      to make the algorithm start properly every time the hook is set.
      Signed-off-by: default avatarRafael J. Wysocki <rafael.j.wysocki@intel.com>
      febce40f
  4. 31 Mar, 2016 1 commit
    • Rafael J. Wysocki's avatar
      intel_pstate: Do not set utilization update hook too early · bb6ab52f
      Rafael J. Wysocki authored
      The utilization update hook in the intel_pstate driver is set too
      early, as it only should be set after the policy has been fully
      initialized by the core.  That may cause intel_pstate_update_util()
      to use incorrect data and put the CPUs into incorrect P-states as
      a result.
      
      To prevent that from happening, make intel_pstate_set_policy() set
      the utilization update hook instead of intel_pstate_init_cpu() so
      intel_pstate_update_util() only runs when all things have been
      initialized as appropriate.
      Signed-off-by: default avatarRafael J. Wysocki <rafael.j.wysocki@intel.com>
      bb6ab52f
  5. 22 Mar, 2016 7 commits
  6. 19 Mar, 2016 2 commits
    • Richard Cochran's avatar
      cpufreq: acpi-cpufreq: Clean up hot plug notifier callback · ed72662a
      Richard Cochran authored
      This driver has two issues.  First, it tries to fiddle with the hot
      plugged CPU's MSR on the UP_PREPARE event, at a time when the CPU is
      not yet online.  Second, the driver sets the "boost-disable" bit for a
      CPU when going down, but does not clear the bit again if the CPU comes
      up again due to DOWN_FAILED.
      
      This patch fixes the issues by changing the driver to react to the
      ONLINE/DOWN_FAILED events instead of UP_PREPARE.  As an added benefit,
      the driver also becomes symmetric with respect to the hot plug
      mechanism.
      Signed-off-by: default avatarRichard Cochran <rcochran@linutronix.de>
      Signed-off-by: default avatarRafael J. Wysocki <rafael.j.wysocki@intel.com>
      ed72662a
    • Rafael J. Wysocki's avatar
      intel_pstate: Do not call wrmsrl_on_cpu() with disabled interrupts · fdfdb2b1
      Rafael J. Wysocki authored
      After commit a4675fbc (cpufreq: intel_pstate: Replace timers with
      utilization update callbacks) wrmsrl_on_cpu() cannot be called in the
      intel_pstate_adjust_busy_pstate() path as that is executed with
      disabled interrupts.  However, atom_set_pstate() called from there
      via intel_pstate_set_pstate() uses wrmsrl_on_cpu() to update the
      IA32_PERF_CTL MSR which triggers the WARN_ON_ONCE() in
      smp_call_function_single().
      
      The reason why wrmsrl_on_cpu() is used by atom_set_pstate() is
      because intel_pstate_set_pstate() calling it is also invoked during
      the initialization and cleanup of the driver and in those cases it is
      not guaranteed to be run on the CPU that is being updated.  However,
      in the case when intel_pstate_set_pstate() is called by
      intel_pstate_adjust_busy_pstate(), wrmsrl() can be used to update
      the register safely.  Moreover, intel_pstate_set_pstate() already
      contains code that only is executed if the function is called by
      intel_pstate_adjust_busy_pstate() and there is a special argument
      passed to it because of that.
      
      To fix the problem at hand, rearrange the code taking the above
      observations into account.
      
      First, replace the ->set() callback in struct pstate_funcs with a
      ->get_val() one that will return the value to be written to the
      IA32_PERF_CTL MSR without updating the register.
      
      Second, split intel_pstate_set_pstate() into two functions,
      intel_pstate_update_pstate() to be called by
      intel_pstate_adjust_busy_pstate() that will contain all of the
      intel_pstate_set_pstate() code which only needs to be executed in
      that case and will use wrmsrl() to update the MSR (after obtaining
      the value to write to it from the ->get_val() callback), and
      intel_pstate_set_min_pstate() to be invoked during the
      initialization and cleanup that will set the P-state to the
      minimum one and will update the MSR using wrmsrl_on_cpu().
      
      Finally, move the code shared between intel_pstate_update_pstate()
      and intel_pstate_set_min_pstate() to a new static inline function
      intel_pstate_record_pstate() and make them both call it.
      
      Of course, that unifies the handling of the IA32_PERF_CTL MSR writes
      between Atom and Core.
      
      Fixes: a4675fbc (cpufreq: intel_pstate: Replace timers with utilization update callbacks)
      Reported-and-tested-by: default avatarJosh Boyer <jwboyer@fedoraproject.org>
      Signed-off-by: default avatarRafael J. Wysocki <rafael.j.wysocki@intel.com>
      fdfdb2b1
  7. 18 Mar, 2016 1 commit
  8. 10 Mar, 2016 7 commits
  9. 09 Mar, 2016 17 commits
    • Viresh Kumar's avatar
      Revert "cpufreq: postfix policy directory with the first CPU in related_cpus" · edd4a893
      Viresh Kumar authored
      Revert commit 3510fac4 (cpufreq: postfix policy directory with the
      first CPU in related_cpus).
      
      Earlier, the policy->kobj was added to the kobject core, before ->init()
      callback was called for the cpufreq drivers. Which allowed those drivers
      to add or remove, driver dependent, sysfs files/directories to the same
      kobj from their ->init() and ->exit() callbacks.
      
      That isn't possible anymore after commit 3510fac4.
      
      Now, there is no other clean alternative that people can adopt.
      
      Its better to revert the earlier commit to allow cpufreq drivers to
      create/remove sysfs files from ->init() and ->exit() callbacks.
      Signed-off-by: default avatarViresh Kumar <viresh.kumar@linaro.org>
      Signed-off-by: default avatarRafael J. Wysocki <rafael.j.wysocki@intel.com>
      edd4a893
    • Rafael J. Wysocki's avatar
      cpufreq: Reduce cpufreq_update_util() overhead a bit · 08f511fd
      Rafael J. Wysocki authored
      Use the observation that cpufreq_update_util() is only called
      by the scheduler with rq->lock held, so the callers of
      cpufreq_set_update_util_data() can use synchronize_sched()
      instead of synchronize_rcu() to wait for cpufreq_update_util()
      to complete.  Moreover, if they are updated to do that,
      rcu_read_(un)lock() calls in cpufreq_update_util() might be
      replaced with rcu_read_(un)lock_sched(), respectively, but
      those aren't really necessary, because the scheduler calls
      that function from RCU-sched read-side critical sections
      already.
      
      In addition to that, if cpufreq_set_update_util_data() checks
      the func field in the struct update_util_data before setting
      the per-CPU pointer to it, the data->func check may be dropped
      from cpufreq_update_util() as well.
      
      Make the above changes to reduce the overhead from
      cpufreq_update_util() in the scheduler paths invoking it
      and to make the cleanup after removing its callbacks less
      heavy-weight somewhat.
      Signed-off-by: default avatarRafael J. Wysocki <rafael.j.wysocki@intel.com>
      Acked-by: default avatarViresh Kumar <viresh.kumar@linaro.org>
      Acked-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      08f511fd
    • Rafael J. Wysocki's avatar
      cpufreq: Select IRQ_WORK if CPU_FREQ_GOV_COMMON is set · e6f03657
      Rafael J. Wysocki authored
      Commit 0eb463be3436 (cpufreq: governor: Replace timers with utilization
      update callbacks) made CPU_FREQ select IRQ_WORK, but that's not
      necessary, as it is sufficient for IRQ_WORK to be selected by
      CPU_FREQ_GOV_COMMON, so modify the cpufreq Kconfig to that effect.
      Signed-off-by: default avatarRafael J. Wysocki <rafael.j.wysocki@intel.com>
      Acked-by: default avatarViresh Kumar <viresh.kumar@linaro.org>
      e6f03657
    • Viresh Kumar's avatar
      cpufreq: Remove 'policy->governor_enabled' · 242aa883
      Viresh Kumar authored
      The entire sequence of events (like INIT/START or STOP/EXIT) for which
      cpufreq_governor() is called, is guaranteed to be protected by
      policy->rwsem now.
      
      The additional checks that were added earlier (as we were forced to drop
      policy->rwsem before calling cpufreq_governor() for EXIT event), aren't
      required anymore.
      
      Over that, they weren't sufficient really. They just take care of
      START/STOP events, but not INIT/EXIT and the state machine was never
      maintained properly by them.
      
      Kill the unnecessary checks and policy->governor_enabled field.
      Signed-off-by: default avatarViresh Kumar <viresh.kumar@linaro.org>
      Signed-off-by: default avatarRafael J. Wysocki <rafael.j.wysocki@intel.com>
      242aa883
    • Viresh Kumar's avatar
      cpufreq: Rename __cpufreq_governor() to cpufreq_governor() · a1317e09
      Viresh Kumar authored
      The __ at the beginning of the routine aren't really necessary at all.
      Rename it to cpufreq_governor() instead.
      Signed-off-by: default avatarViresh Kumar <viresh.kumar@linaro.org>
      Signed-off-by: default avatarRafael J. Wysocki <rafael.j.wysocki@intel.com>
      a1317e09
    • Viresh Kumar's avatar
      cpufreq: Relocate handle_update() to kill its declaration · 11eb69b9
      Viresh Kumar authored
      handle_update() is declared at the top of the file as its user appear
      before its definition. Relocate the routine to get rid of this.
      Signed-off-by: default avatarViresh Kumar <viresh.kumar@linaro.org>
      Signed-off-by: default avatarRafael J. Wysocki <rafael.j.wysocki@intel.com>
      11eb69b9
    • Viresh Kumar's avatar
      cpufreq: governor: Drop unnecessary checks from show() and store() · f737236b
      Viresh Kumar authored
      The show() and store() routines in the cpufreq-governor core don't need
      to check if the struct governor_attr they want to use really provides
      the callbacks they need as expected (if that's not the case, it means a
      bug in the code anyway), so change them to avoid doing that.
      
      Also change the error value to -EBUSY, if the governor is getting
      removed and we aren't allowed to store any more changes.
      Signed-off-by: default avatarViresh Kumar <viresh.kumar@linaro.org>
      Signed-off-by: default avatarRafael J. Wysocki <rafael.j.wysocki@intel.com>
      f737236b
    • Rafael J. Wysocki's avatar
      cpufreq: governor: Fix race in dbs_update_util_handler() · 27de3482
      Rafael J. Wysocki authored
      There is a scenario that may lead to undesired results in
      dbs_update_util_handler().  Namely, if two CPUs sharing a policy
      enter the funtion at the same time, pass the sample delay check
      and then one of them is stalled until dbs_work_handler() (queued
      up by the other CPU) clears the work counter, it may update the
      work counter and queue up another work item prematurely.
      
      To prevent that from happening, use the observation that the CPU
      queuing up a work item in dbs_update_util_handler() updates the
      last sample time.  This means that if another CPU was stalling after
      passing the sample delay check and now successfully updated the work
      counter as a result of the race described above, it will see the new
      value of the last sample time which is different from what it used in
      the sample delay check before.  If that happens, the sample delay
      check passed previously is not valid any more, so the CPU should not
      continue.
      
      Fixes: f17cbb53783c (cpufreq: governor: Avoid atomic operations in hot paths)
      Signed-off-by: default avatarRafael J. Wysocki <rafael.j.wysocki@intel.com>
      Acked-by: default avatarViresh Kumar <viresh.kumar@linaro.org>
      27de3482
    • Rafael J. Wysocki's avatar
      cpufreq: governor: Make gov_set_update_util() static · 94ab5e03
      Rafael J. Wysocki authored
      The gov_set_update_util() routine is only used internally by the
      common governor code and it doesn't need to be exported, so make
      it static.
      
      No functional changes.
      Signed-off-by: default avatarRafael J. Wysocki <rafael.j.wysocki@intel.com>
      Acked-by: default avatarViresh Kumar <viresh.kumar@linaro.org>
      94ab5e03
    • Rafael J. Wysocki's avatar
      cpufreq: governor: Narrow down the dbs_data_mutex coverage · 1112e9d8
      Rafael J. Wysocki authored
      Since cpufreq_governor_dbs() is now always called with policy->rwsem
      held, it cannot be executed twice in parallel for the same policy.
      Thus it is not necessary to hold dbs_data_mutex around the invocations
      of cpufreq_governor_start/stop/limits() from it as those functions
      never modify any data that can be shared between different policies.
      
      However, cpufreq_governor_dbs() may be executed twice in parallal
      for different policies using the same gov->gdbs_data object and
      dbs_data_mutex is still necessary to protect that object against
      concurrent updates.
      
      For this reason, narrow down the dbs_data_mutex locking to
      cpufreq_governor_init/exit() where it is needed and rename the
      mutex to gov_dbs_data_mutex to reflect its purpose.
      Signed-off-by: default avatarRafael J. Wysocki <rafael.j.wysocki@intel.com>
      Acked-by: default avatarViresh Kumar <viresh.kumar@linaro.org>
      1112e9d8
    • Rafael J. Wysocki's avatar
      cpufreq: governor: Make dbs_data_mutex static · e3f5ed93
      Rafael J. Wysocki authored
      That mutex is only used by cpufreq_governor_dbs() and it doesn't
      need to be exported to modules, so make it static and drop the
      export incantation.
      
      No functional changes.
      Signed-off-by: default avatarRafael J. Wysocki <rafael.j.wysocki@intel.com>
      Acked-by: default avatarViresh Kumar <viresh.kumar@linaro.org>
      e3f5ed93
    • Rafael J. Wysocki's avatar
      cpufreq: governor: Relocate definitions of tuners structures · 47ebaac1
      Rafael J. Wysocki authored
      Move the definitions of struct od_dbs_tuners and struct cs_dbs_tuners
      from the common governor header to the ondemand and conservative
      governor code, respectively, as they don't need to be in the common
      header any more.
      
      No functional changes.
      Signed-off-by: default avatarRafael J. Wysocki <rafael.j.wysocki@intel.com>
      Acked-by: default avatarViresh Kumar <viresh.kumar@linaro.org>
      47ebaac1
    • Rafael J. Wysocki's avatar
      cpufreq: governor: Move per-CPU data to the common code · 8c8f77fd
      Rafael J. Wysocki authored
      After previous changes there is only one piece of code in the
      ondemand governor making references to per-CPU data structures,
      but it can be easily modified to avoid doing that, so modify it
      accordingly and move the definition of per-CPU data used by the
      ondemand and conservative governors to the common code.  Next,
      change that code to access the per-CPU data structures directly
      rather than via a governor callback.
      
      This causes the ->get_cpu_cdbs governor callback to become
      unnecessary, so drop it along with the macro and function
      definitions related to it.
      
      Finally, drop the definitions of struct od_cpu_dbs_info_s and
      struct cs_cpu_dbs_info_s that aren't necessary any more.
      Signed-off-by: default avatarRafael J. Wysocki <rafael.j.wysocki@intel.com>
      Acked-by: default avatarViresh Kumar <viresh.kumar@linaro.org>
      8c8f77fd
    • Rafael J. Wysocki's avatar
      cpufreq: governor: Make governor private data per-policy · 7d5a9956
      Rafael J. Wysocki authored
      Some fields in struct od_cpu_dbs_info_s and struct cs_cpu_dbs_info_s
      are only used for a limited set of CPUs.  Namely, if a policy is
      shared between multiple CPUs, those fields will only be used for one
      of them (policy->cpu).  This means that they really are per-policy
      rather than per-CPU and holding room for them in per-CPU data
      structures is generally wasteful.  Also moving those fields into
      per-policy data structures will allow some significant simplifications
      to be made going forward.
      
      For this reason, introduce struct cs_policy_dbs_info and
      struct od_policy_dbs_info to hold those fields.  Define each of the
      new structures as an extension of struct policy_dbs_info (such that
      struct policy_dbs_info is embedded in each of them) and introduce
      new ->alloc and ->free governor callbacks to allocate and free
      those structures, respectively, such that ->alloc() will return
      a pointer to the struct policy_dbs_info embedded in the allocated
      data structure and ->free() will take that pointer as its argument.
      
      With that, modify the code accessing the data fields in question
      in per-CPU data objects to look for them in the new structures
      via the struct policy_dbs_info pointer available to it and drop
      them from struct od_cpu_dbs_info_s and struct cs_cpu_dbs_info_s.
      Signed-off-by: default avatarRafael J. Wysocki <rafael.j.wysocki@intel.com>
      Acked-by: default avatarViresh Kumar <viresh.kumar@linaro.org>
      7d5a9956
    • Rafael J. Wysocki's avatar
      cpufreq: ondemand: Rework the handling of powersave bias updates · d1db75ff
      Rafael J. Wysocki authored
      The ondemand_powersave_bias_init() function used for resetting data
      fields related to the powersave bias tunable of the ondemand governor
      works by walking all of the online CPUs in the system and updating the
      od_cpu_dbs_info_s structures for all of them.
      
      However, if governor tunables are per policy, the update should not
      touch the CPUs that are not associated with the given dbs_data.
      
      Moreover, since the data fields in question are only ever used for
      policy->cpu in each policy governed by ondemand, the update can be
      limited to those specific CPUs.
      
      Rework the code to take the above observations into account.
      Signed-off-by: default avatarRafael J. Wysocki <rafael.j.wysocki@intel.com>
      Acked-by: default avatarViresh Kumar <viresh.kumar@linaro.org>
      d1db75ff
    • Rafael J. Wysocki's avatar
      cpufreq: governor: Fix CPU load information updates via ->store · a33cce1c
      Rafael J. Wysocki authored
      The ->store() callbacks of some tunable sysfs attributes of the
      ondemand and conservative governors trigger immediate updates of
      the CPU load information for all CPUs "governed" by the given
      dbs_data by walking the cpu_dbs_info structures for all online
      CPUs in the system and updating them.
      
      This is questionable for two reasons.  First, it may lead to a lot of
      extra overhead on a system with many CPUs if the given dbs_data is
      only associated with a few of them.  Second, if governor tunables are
      per-policy, the CPUs associated with the other sets of governor
      tunables should not be updated.
      
      To address this issue, use the observation that in all of the places
      in question the update operation may be carried out in the same way
      (because all of the tunables involved are now located in struct
      dbs_data and readily available to the common code) and make the
      code in those places invoke the same (new) helper function that
      will carry out the update correctly.
      
      That new function always checks the ignore_nice_load tunable value
      and updates the CPUs' prev_cpu_nice data fields if that's set, which
      wasn't done by the original code in store_io_is_busy(), but it
      should have been done in there too.
      Signed-off-by: default avatarRafael J. Wysocki <rafael.j.wysocki@intel.com>
      Acked-by: default avatarViresh Kumar <viresh.kumar@linaro.org>
      a33cce1c
    • Rafael J. Wysocki's avatar
      cpufreq: ondemand: Drop one more callback from struct od_ops · 76c5f66a
      Rafael J. Wysocki authored
      The ->powersave_bias_init_cpu callback in struct od_ops is only used
      in one place and that invocation may be replaced with a direct call
      to the function pointed to by that callback, so change the code
      accordingly and drop the callback.
      Signed-off-by: default avatarRafael J. Wysocki <rafael.j.wysocki@intel.com>
      Acked-by: default avatarViresh Kumar <viresh.kumar@linaro.org>
      76c5f66a