Commit 3563f55c authored by Linus Torvalds's avatar Linus Torvalds

Merge tag 'pm-5.14-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm

Pull power management updates from Rafael Wysocki:
 "These add hybrid processors support to the intel_pstate driver and
  make it work with more processor models when HWP is disabled, make the
  intel_idle driver use special C6 idle state paremeters when package
  C-states are disabled, add cooling support to the tegra30 devfreq
  driver, rework the TEO (timer events oriented) cpuidle governor,
  extend the OPP (operating performance points) framework to use the
  required-opps DT property in more cases, fix some issues and clean up
  a number of assorted pieces of code.

  Specifics:

   - Make intel_pstate support hybrid processors using abstract
     performance units in the HWP interface (Rafael Wysocki).

   - Add Icelake servers and Cometlake support in no-HWP mode to
     intel_pstate (Giovanni Gherdovich).

   - Make cpufreq_online() error path be consistent with the CPU device
     removal path in cpufreq (Rafael Wysocki).

   - Clean up 3 cpufreq drivers and the statistics code (Hailong Liu,
     Randy Dunlap, Shaokun Zhang).

   - Make intel_idle use special idle state parameters for C6 when
     package C-states are disabled (Chen Yu).

   - Rework the TEO (timer events oriented) cpuidle governor to address
     some theoretical shortcomings in it (Rafael Wysocki).

   - Drop unneeded semicolon from the TEO governor (Wan Jiabing).

   - Modify the runtime PM framework to accept unassigned suspend and
     resume callback pointers (Ulf Hansson).

   - Improve pm_runtime_get_sync() documentation (Krzysztof Kozlowski).

   - Improve device performance states support in the generic power
     domains (genpd) framework (Ulf Hansson).

   - Fix some documentation issues in genpd (Yang Yingliang).

   - Make the operating performance points (OPP) framework use the
     required-opps DT property in use cases that are not related to
     genpd (Hsin-Yi Wang).

   - Make lazy_link_required_opp_table() use list_del_init instead of
     list_del/INIT_LIST_HEAD (Yang Yingliang).

   - Simplify wake IRQs handling in the core system-wide sleep support
     code and clean up some coding style inconsistencies in it (Tian
     Tao, Zhen Lei).

   - Add cooling support to the tegra30 devfreq driver and improve its
     DT bindings (Dmitry Osipenko).

   - Fix some assorted issues in the devfreq core and drivers (Chanwoo
     Choi, Dong Aisheng, YueHaibing)"

* tag 'pm-5.14-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm: (39 commits)
  PM / devfreq: passive: Fix get_target_freq when not using required-opp
  cpufreq: Make cpufreq_online() call driver->offline() on errors
  opp: Allow required-opps to be used for non genpd use cases
  cpuidle: teo: remove unneeded semicolon in teo_select()
  dt-bindings: devfreq: tegra30-actmon: Add cooling-cells
  dt-bindings: devfreq: tegra30-actmon: Convert to schema
  PM / devfreq: userspace: Use DEVICE_ATTR_RW macro
  PM: runtime: Clarify documentation when callbacks are unassigned
  PM: runtime: Allow unassigned ->runtime_suspend|resume callbacks
  PM: runtime: Improve path in rpm_idle() when no callback
  PM: hibernate: remove leading spaces before tabs
  PM: sleep: remove trailing spaces and tabs
  PM: domains: Drop/restore performance state votes for devices at runtime PM
  PM: domains: Return early if perf state is already set for the device
  PM: domains: Split code in dev_pm_genpd_set_performance_state()
  cpuidle: teo: Use kerneldoc documentation in admin-guide
  cpuidle: teo: Rework most recent idle duration values treatment
  cpuidle: teo: Change the main idle state selection logic
  cpuidle: teo: Cosmetic modification of teo_select()
  cpuidle: teo: Cosmetic modifications of teo_update()
  ...
parents 1dfb0f47 22b65d31
...@@ -347,81 +347,8 @@ for tickless systems. It follows the same basic strategy as the ``menu`` `one ...@@ -347,81 +347,8 @@ for tickless systems. It follows the same basic strategy as the ``menu`` `one
<menu-gov_>`_: it always tries to find the deepest idle state suitable for the <menu-gov_>`_: it always tries to find the deepest idle state suitable for the
given conditions. However, it applies a different approach to that problem. given conditions. However, it applies a different approach to that problem.
First, it does not use sleep length correction factors, but instead it attempts .. kernel-doc:: drivers/cpuidle/governors/teo.c
to correlate the observed idle duration values with the available idle states :doc: teo-description
and use that information to pick up the idle state that is most likely to
"match" the upcoming CPU idle interval. Second, it does not take the tasks
that were running on the given CPU in the past and are waiting on some I/O
operations to complete now at all (there is no guarantee that they will run on
the same CPU when they become runnable again) and the pattern detection code in
it avoids taking timer wakeups into account. It also only uses idle duration
values less than the current time till the closest timer (with the scheduler
tick excluded) for that purpose.
Like in the ``menu`` governor `case <menu-gov_>`_, the first step is to obtain
the *sleep length*, which is the time until the closest timer event with the
assumption that the scheduler tick will be stopped (that also is the upper bound
on the time until the next CPU wakeup). That value is then used to preselect an
idle state on the basis of three metrics maintained for each idle state provided
by the ``CPUIdle`` driver: ``hits``, ``misses`` and ``early_hits``.
The ``hits`` and ``misses`` metrics measure the likelihood that a given idle
state will "match" the observed (post-wakeup) idle duration if it "matches" the
sleep length. They both are subject to decay (after a CPU wakeup) every time
the target residency of the idle state corresponding to them is less than or
equal to the sleep length and the target residency of the next idle state is
greater than the sleep length (that is, when the idle state corresponding to
them "matches" the sleep length). The ``hits`` metric is increased if the
former condition is satisfied and the target residency of the given idle state
is less than or equal to the observed idle duration and the target residency of
the next idle state is greater than the observed idle duration at the same time
(that is, it is increased when the given idle state "matches" both the sleep
length and the observed idle duration). In turn, the ``misses`` metric is
increased when the given idle state "matches" the sleep length only and the
observed idle duration is too short for its target residency.
The ``early_hits`` metric measures the likelihood that a given idle state will
"match" the observed (post-wakeup) idle duration if it does not "match" the
sleep length. It is subject to decay on every CPU wakeup and it is increased
when the idle state corresponding to it "matches" the observed (post-wakeup)
idle duration and the target residency of the next idle state is less than or
equal to the sleep length (i.e. the idle state "matching" the sleep length is
deeper than the given one).
The governor walks the list of idle states provided by the ``CPUIdle`` driver
and finds the last (deepest) one with the target residency less than or equal
to the sleep length. Then, the ``hits`` and ``misses`` metrics of that idle
state are compared with each other and it is preselected if the ``hits`` one is
greater (which means that that idle state is likely to "match" the observed idle
duration after CPU wakeup). If the ``misses`` one is greater, the governor
preselects the shallower idle state with the maximum ``early_hits`` metric
(or if there are multiple shallower idle states with equal ``early_hits``
metric which also is the maximum, the shallowest of them will be preselected).
[If there is a wakeup latency constraint coming from the `PM QoS framework
<cpu-pm-qos_>`_ which is hit before reaching the deepest idle state with the
target residency within the sleep length, the deepest idle state with the exit
latency within the constraint is preselected without consulting the ``hits``,
``misses`` and ``early_hits`` metrics.]
Next, the governor takes several idle duration values observed most recently
into consideration and if at least a half of them are greater than or equal to
the target residency of the preselected idle state, that idle state becomes the
final candidate to ask for. Otherwise, the average of the most recent idle
duration values below the target residency of the preselected idle state is
computed and the governor walks the idle states shallower than the preselected
one and finds the deepest of them with the target residency within that average.
That idle state is then taken as the final candidate to ask for.
Still, at this point the governor may need to refine the idle state selection if
it has not decided to `stop the scheduler tick <idle-cpus-and-tick_>`_. That
generally happens if the target residency of the idle state selected so far is
less than the tick period and the tick has not been stopped already (in a
previous iteration of the idle loop). Then, like in the ``menu`` governor
`case <menu-gov_>`_, the sleep length used in the previous computations may not
reflect the real time until the closest timer event and if it really is greater
than that time, a shallower state with a suitable target residency may need to
be selected.
.. _idle-states-representation: .. _idle-states-representation:
......
...@@ -365,6 +365,9 @@ argument is passed to the kernel in the command line. ...@@ -365,6 +365,9 @@ argument is passed to the kernel in the command line.
inclusive) including both turbo and non-turbo P-states (see inclusive) including both turbo and non-turbo P-states (see
`Turbo P-states Support`_). `Turbo P-states Support`_).
This attribute is present only if the value exposed by it is the same
for all of the CPUs in the system.
The value of this attribute is not affected by the ``no_turbo`` The value of this attribute is not affected by the ``no_turbo``
setting described `below <no_turbo_attr_>`_. setting described `below <no_turbo_attr_>`_.
...@@ -374,6 +377,9 @@ argument is passed to the kernel in the command line. ...@@ -374,6 +377,9 @@ argument is passed to the kernel in the command line.
Ratio of the `turbo range <turbo_>`_ size to the size of the entire Ratio of the `turbo range <turbo_>`_ size to the size of the entire
range of supported P-states, in percent. range of supported P-states, in percent.
This attribute is present only if the value exposed by it is the same
for all of the CPUs in the system.
This attribute is read-only. This attribute is read-only.
.. _no_turbo_attr: .. _no_turbo_attr:
......
NVIDIA Tegra Activity Monitor
The activity monitor block collects statistics about the behaviour of other
components in the system. This information can be used to derive the rate at
which the external memory needs to be clocked in order to serve all requests
from the monitored clients.
Required properties:
- compatible: should be "nvidia,tegra<chip>-actmon"
- reg: offset and length of the register set for the device
- interrupts: standard interrupt property
- clocks: Must contain a phandle and clock specifier pair for each entry in
clock-names. See ../../clock/clock-bindings.txt for details.
- clock-names: Must include the following entries:
- actmon
- emc
- resets: Must contain an entry for each entry in reset-names. See
../../reset/reset.txt for details.
- reset-names: Must include the following entries:
- actmon
- operating-points-v2: See ../bindings/opp/opp.txt for details.
- interconnects: Should contain entries for memory clients sitting on
MC->EMC memory interconnect path.
- interconnect-names: Should include name of the interconnect path for each
interconnect entry. Consult TRM documentation for
information about available memory clients, see MEMORY
CONTROLLER section.
For each opp entry in 'operating-points-v2' table:
- opp-supported-hw: bitfield indicating SoC speedo ID mask
- opp-peak-kBps: peak bandwidth of the memory channel
Example:
dfs_opp_table: opp-table {
compatible = "operating-points-v2";
opp@12750000 {
opp-hz = /bits/ 64 <12750000>;
opp-supported-hw = <0x000F>;
opp-peak-kBps = <51000>;
};
...
};
actmon@6000c800 {
compatible = "nvidia,tegra124-actmon";
reg = <0x0 0x6000c800 0x0 0x400>;
interrupts = <GIC_SPI 45 IRQ_TYPE_LEVEL_HIGH>;
clocks = <&tegra_car TEGRA124_CLK_ACTMON>,
<&tegra_car TEGRA124_CLK_EMC>;
clock-names = "actmon", "emc";
resets = <&tegra_car 119>;
reset-names = "actmon";
operating-points-v2 = <&dfs_opp_table>;
interconnects = <&mc TEGRA124_MC_MPCORER &emc>;
interconnect-names = "cpu";
};
# SPDX-License-Identifier: (GPL-2.0-only OR BSD-2-Clause)
%YAML 1.2
---
$id: http://devicetree.org/schemas/devfreq/nvidia,tegra30-actmon.yaml#
$schema: http://devicetree.org/meta-schemas/core.yaml#
title: NVIDIA Tegra30 Activity Monitor
maintainers:
- Dmitry Osipenko <digetx@gmail.com>
- Jon Hunter <jonathanh@nvidia.com>
- Thierry Reding <thierry.reding@gmail.com>
description: |
The activity monitor block collects statistics about the behaviour of other
components in the system. This information can be used to derive the rate at
which the external memory needs to be clocked in order to serve all requests
from the monitored clients.
properties:
compatible:
enum:
- nvidia,tegra30-actmon
- nvidia,tegra114-actmon
- nvidia,tegra124-actmon
- nvidia,tegra210-actmon
reg:
maxItems: 1
clocks:
maxItems: 2
clock-names:
items:
- const: actmon
- const: emc
resets:
maxItems: 1
reset-names:
items:
- const: actmon
interrupts:
maxItems: 1
interconnects:
minItems: 1
maxItems: 12
interconnect-names:
minItems: 1
maxItems: 12
description:
Should include name of the interconnect path for each interconnect
entry. Consult TRM documentation for information about available
memory clients, see MEMORY CONTROLLER and ACTIVITY MONITOR sections.
operating-points-v2:
description:
Should contain freqs and voltages and opp-supported-hw property, which
is a bitfield indicating SoC speedo ID mask.
"#cooling-cells":
const: 2
required:
- compatible
- reg
- clocks
- clock-names
- resets
- reset-names
- interrupts
- interconnects
- interconnect-names
- operating-points-v2
- "#cooling-cells"
additionalProperties: false
examples:
- |
#include <dt-bindings/memory/tegra30-mc.h>
mc: memory-controller@7000f000 {
compatible = "nvidia,tegra30-mc";
reg = <0x7000f000 0x400>;
clocks = <&clk 32>;
clock-names = "mc";
interrupts = <0 77 4>;
#iommu-cells = <1>;
#reset-cells = <1>;
#interconnect-cells = <1>;
};
emc: external-memory-controller@7000f400 {
compatible = "nvidia,tegra30-emc";
reg = <0x7000f400 0x400>;
interrupts = <0 78 4>;
clocks = <&clk 57>;
nvidia,memory-controller = <&mc>;
operating-points-v2 = <&dvfs_opp_table>;
power-domains = <&domain>;
#interconnect-cells = <0>;
};
actmon@6000c800 {
compatible = "nvidia,tegra30-actmon";
reg = <0x6000c800 0x400>;
interrupts = <0 45 4>;
clocks = <&clk 119>, <&clk 57>;
clock-names = "actmon", "emc";
resets = <&rst 119>;
reset-names = "actmon";
operating-points-v2 = <&dvfs_opp_table>;
interconnects = <&mc TEGRA30_MC_MPCORER &emc>;
interconnect-names = "cpu-read";
#cooling-cells = <2>;
};
...@@ -378,7 +378,11 @@ drivers/base/power/runtime.c and include/linux/pm_runtime.h: ...@@ -378,7 +378,11 @@ drivers/base/power/runtime.c and include/linux/pm_runtime.h:
`int pm_runtime_get_sync(struct device *dev);` `int pm_runtime_get_sync(struct device *dev);`
- increment the device's usage counter, run pm_runtime_resume(dev) and - increment the device's usage counter, run pm_runtime_resume(dev) and
return its result return its result;
note that it does not drop the device's usage counter on errors, so
consider using pm_runtime_resume_and_get() instead of it, especially
if its return value is checked by the caller, as this is likely to
result in cleaner code.
`int pm_runtime_get_if_in_use(struct device *dev);` `int pm_runtime_get_if_in_use(struct device *dev);`
- return -EINVAL if 'power.disable_depth' is nonzero; otherwise, if the - return -EINVAL if 'power.disable_depth' is nonzero; otherwise, if the
...@@ -827,6 +831,15 @@ or driver about runtime power changes. Instead, the driver for the device's ...@@ -827,6 +831,15 @@ or driver about runtime power changes. Instead, the driver for the device's
parent must take responsibility for telling the device's driver when the parent must take responsibility for telling the device's driver when the
parent's power state changes. parent's power state changes.
Note that, in some cases it may not be desirable for subsystems/drivers to call
pm_runtime_no_callbacks() for their devices. This could be because a subset of
the runtime PM callbacks needs to be implemented, a platform dependent PM
domain could get attached to the device or that the device is power managed
through a supplier device link. For these reasons and to avoid boilerplate code
in subsystems/drivers, the PM core allows runtime PM callbacks to be
unassigned. More precisely, if a callback pointer is NULL, the PM core will act
as though there was a callback and it returned 0.
9. Autosuspend, or automatically-delayed suspends 9. Autosuspend, or automatically-delayed suspends
================================================= =================================================
......
...@@ -379,6 +379,44 @@ static int _genpd_set_performance_state(struct generic_pm_domain *genpd, ...@@ -379,6 +379,44 @@ static int _genpd_set_performance_state(struct generic_pm_domain *genpd,
return ret; return ret;
} }
static int genpd_set_performance_state(struct device *dev, unsigned int state)
{
struct generic_pm_domain *genpd = dev_to_genpd(dev);
struct generic_pm_domain_data *gpd_data = dev_gpd_data(dev);
unsigned int prev_state;
int ret;
prev_state = gpd_data->performance_state;
if (prev_state == state)
return 0;
gpd_data->performance_state = state;
state = _genpd_reeval_performance_state(genpd, state);
ret = _genpd_set_performance_state(genpd, state, 0);
if (ret)
gpd_data->performance_state = prev_state;
return ret;
}
static int genpd_drop_performance_state(struct device *dev)
{
unsigned int prev_state = dev_gpd_data(dev)->performance_state;
if (!genpd_set_performance_state(dev, 0))
return prev_state;
return 0;
}
static void genpd_restore_performance_state(struct device *dev,
unsigned int state)
{
if (state)
genpd_set_performance_state(dev, state);
}
/** /**
* dev_pm_genpd_set_performance_state- Set performance state of device's power * dev_pm_genpd_set_performance_state- Set performance state of device's power
* domain. * domain.
...@@ -397,8 +435,6 @@ static int _genpd_set_performance_state(struct generic_pm_domain *genpd, ...@@ -397,8 +435,6 @@ static int _genpd_set_performance_state(struct generic_pm_domain *genpd,
int dev_pm_genpd_set_performance_state(struct device *dev, unsigned int state) int dev_pm_genpd_set_performance_state(struct device *dev, unsigned int state)
{ {
struct generic_pm_domain *genpd; struct generic_pm_domain *genpd;
struct generic_pm_domain_data *gpd_data;
unsigned int prev;
int ret; int ret;
genpd = dev_to_genpd_safe(dev); genpd = dev_to_genpd_safe(dev);
...@@ -410,16 +446,7 @@ int dev_pm_genpd_set_performance_state(struct device *dev, unsigned int state) ...@@ -410,16 +446,7 @@ int dev_pm_genpd_set_performance_state(struct device *dev, unsigned int state)
return -EINVAL; return -EINVAL;
genpd_lock(genpd); genpd_lock(genpd);
ret = genpd_set_performance_state(dev, state);
gpd_data = to_gpd_data(dev->power.subsys_data->domain_data);
prev = gpd_data->performance_state;
gpd_data->performance_state = state;
state = _genpd_reeval_performance_state(genpd, state);
ret = _genpd_set_performance_state(genpd, state, 0);
if (ret)
gpd_data->performance_state = prev;
genpd_unlock(genpd); genpd_unlock(genpd);
return ret; return ret;
...@@ -572,6 +599,7 @@ static void genpd_queue_power_off_work(struct generic_pm_domain *genpd) ...@@ -572,6 +599,7 @@ static void genpd_queue_power_off_work(struct generic_pm_domain *genpd)
* RPM status of the releated device is in an intermediate state, not yet turned * RPM status of the releated device is in an intermediate state, not yet turned
* into RPM_SUSPENDED. This means genpd_power_off() must allow one device to not * into RPM_SUSPENDED. This means genpd_power_off() must allow one device to not
* be RPM_SUSPENDED, while it tries to power off the PM domain. * be RPM_SUSPENDED, while it tries to power off the PM domain.
* @depth: nesting count for lockdep.
* *
* If all of the @genpd's devices have been suspended and all of its subdomains * If all of the @genpd's devices have been suspended and all of its subdomains
* have been powered down, remove power from @genpd. * have been powered down, remove power from @genpd.
...@@ -832,7 +860,8 @@ static int genpd_runtime_suspend(struct device *dev) ...@@ -832,7 +860,8 @@ static int genpd_runtime_suspend(struct device *dev)
{ {
struct generic_pm_domain *genpd; struct generic_pm_domain *genpd;
bool (*suspend_ok)(struct device *__dev); bool (*suspend_ok)(struct device *__dev);
struct gpd_timing_data *td = &dev_gpd_data(dev)->td; struct generic_pm_domain_data *gpd_data = dev_gpd_data(dev);
struct gpd_timing_data *td = &gpd_data->td;
bool runtime_pm = pm_runtime_enabled(dev); bool runtime_pm = pm_runtime_enabled(dev);
ktime_t time_start; ktime_t time_start;
s64 elapsed_ns; s64 elapsed_ns;
...@@ -889,6 +918,7 @@ static int genpd_runtime_suspend(struct device *dev) ...@@ -889,6 +918,7 @@ static int genpd_runtime_suspend(struct device *dev)
return 0; return 0;
genpd_lock(genpd); genpd_lock(genpd);
gpd_data->rpm_pstate = genpd_drop_performance_state(dev);
genpd_power_off(genpd, true, 0); genpd_power_off(genpd, true, 0);
genpd_unlock(genpd); genpd_unlock(genpd);
...@@ -906,7 +936,8 @@ static int genpd_runtime_suspend(struct device *dev) ...@@ -906,7 +936,8 @@ static int genpd_runtime_suspend(struct device *dev)
static int genpd_runtime_resume(struct device *dev) static int genpd_runtime_resume(struct device *dev)
{ {
struct generic_pm_domain *genpd; struct generic_pm_domain *genpd;
struct gpd_timing_data *td = &dev_gpd_data(dev)->td; struct generic_pm_domain_data *gpd_data = dev_gpd_data(dev);
struct gpd_timing_data *td = &gpd_data->td;
bool runtime_pm = pm_runtime_enabled(dev); bool runtime_pm = pm_runtime_enabled(dev);
ktime_t time_start; ktime_t time_start;
s64 elapsed_ns; s64 elapsed_ns;
...@@ -930,6 +961,8 @@ static int genpd_runtime_resume(struct device *dev) ...@@ -930,6 +961,8 @@ static int genpd_runtime_resume(struct device *dev)
genpd_lock(genpd); genpd_lock(genpd);
ret = genpd_power_on(genpd, 0); ret = genpd_power_on(genpd, 0);
if (!ret)
genpd_restore_performance_state(dev, gpd_data->rpm_pstate);
genpd_unlock(genpd); genpd_unlock(genpd);
if (ret) if (ret)
...@@ -968,6 +1001,7 @@ static int genpd_runtime_resume(struct device *dev) ...@@ -968,6 +1001,7 @@ static int genpd_runtime_resume(struct device *dev)
err_poweroff: err_poweroff:
if (!pm_runtime_is_irq_safe(dev) || genpd_is_irq_safe(genpd)) { if (!pm_runtime_is_irq_safe(dev) || genpd_is_irq_safe(genpd)) {
genpd_lock(genpd); genpd_lock(genpd);
gpd_data->rpm_pstate = genpd_drop_performance_state(dev);
genpd_power_off(genpd, true, 0); genpd_power_off(genpd, true, 0);
genpd_unlock(genpd); genpd_unlock(genpd);
} }
...@@ -2505,7 +2539,7 @@ EXPORT_SYMBOL_GPL(of_genpd_remove_subdomain); ...@@ -2505,7 +2539,7 @@ EXPORT_SYMBOL_GPL(of_genpd_remove_subdomain);
/** /**
* of_genpd_remove_last - Remove the last PM domain registered for a provider * of_genpd_remove_last - Remove the last PM domain registered for a provider
* @provider: Pointer to device structure associated with provider * @np: Pointer to device node associated with provider
* *
* Find the last PM domain that was added by a particular provider and * Find the last PM domain that was added by a particular provider and
* remove this PM domain from the list of PM domains. The provider is * remove this PM domain from the list of PM domains. The provider is
......
...@@ -252,6 +252,7 @@ static bool __default_power_down_ok(struct dev_pm_domain *pd, ...@@ -252,6 +252,7 @@ static bool __default_power_down_ok(struct dev_pm_domain *pd,
/** /**
* _default_power_down_ok - Default generic PM domain power off governor routine. * _default_power_down_ok - Default generic PM domain power off governor routine.
* @pd: PM domain to check. * @pd: PM domain to check.
* @now: current ktime.
* *
* This routine must be executed under the PM domain's lock. * This routine must be executed under the PM domain's lock.
*/ */
......
...@@ -345,7 +345,7 @@ static void rpm_suspend_suppliers(struct device *dev) ...@@ -345,7 +345,7 @@ static void rpm_suspend_suppliers(struct device *dev)
static int __rpm_callback(int (*cb)(struct device *), struct device *dev) static int __rpm_callback(int (*cb)(struct device *), struct device *dev)
__releases(&dev->power.lock) __acquires(&dev->power.lock) __releases(&dev->power.lock) __acquires(&dev->power.lock)
{ {
int retval, idx; int retval = 0, idx;
bool use_links = dev->power.links_count > 0; bool use_links = dev->power.links_count > 0;
if (dev->power.irq_safe) { if (dev->power.irq_safe) {
...@@ -373,6 +373,7 @@ static int __rpm_callback(int (*cb)(struct device *), struct device *dev) ...@@ -373,6 +373,7 @@ static int __rpm_callback(int (*cb)(struct device *), struct device *dev)
} }
} }
if (cb)
retval = cb(dev); retval = cb(dev);
if (dev->power.irq_safe) { if (dev->power.irq_safe) {
...@@ -446,7 +447,10 @@ static int rpm_idle(struct device *dev, int rpmflags) ...@@ -446,7 +447,10 @@ static int rpm_idle(struct device *dev, int rpmflags)
/* Pending requests need to be canceled. */ /* Pending requests need to be canceled. */
dev->power.request = RPM_REQ_NONE; dev->power.request = RPM_REQ_NONE;
if (dev->power.no_callbacks) callback = RPM_GET_CALLBACK(dev, runtime_idle);
/* If no callback assume success. */
if (!callback || dev->power.no_callbacks)
goto out; goto out;
/* Carry out an asynchronous or a synchronous idle notification. */ /* Carry out an asynchronous or a synchronous idle notification. */
...@@ -462,9 +466,6 @@ static int rpm_idle(struct device *dev, int rpmflags) ...@@ -462,9 +466,6 @@ static int rpm_idle(struct device *dev, int rpmflags)
dev->power.idle_notification = true; dev->power.idle_notification = true;
callback = RPM_GET_CALLBACK(dev, runtime_idle);
if (callback)
retval = __rpm_callback(callback, dev); retval = __rpm_callback(callback, dev);
dev->power.idle_notification = false; dev->power.idle_notification = false;
...@@ -484,9 +485,6 @@ static int rpm_callback(int (*cb)(struct device *), struct device *dev) ...@@ -484,9 +485,6 @@ static int rpm_callback(int (*cb)(struct device *), struct device *dev)
{ {
int retval; int retval;
if (!cb)
return -ENOSYS;
if (dev->power.memalloc_noio) { if (dev->power.memalloc_noio) {
unsigned int noio_flag; unsigned int noio_flag;
......
...@@ -182,7 +182,6 @@ int dev_pm_set_dedicated_wake_irq(struct device *dev, int irq) ...@@ -182,7 +182,6 @@ int dev_pm_set_dedicated_wake_irq(struct device *dev, int irq)
wirq->dev = dev; wirq->dev = dev;
wirq->irq = irq; wirq->irq = irq;
irq_set_status_flags(irq, IRQ_NOAUTOEN);
/* Prevent deferred spurious wakeirqs with disable_irq_nosync() */ /* Prevent deferred spurious wakeirqs with disable_irq_nosync() */
irq_set_status_flags(irq, IRQ_DISABLE_UNLAZY); irq_set_status_flags(irq, IRQ_DISABLE_UNLAZY);
...@@ -192,7 +191,8 @@ int dev_pm_set_dedicated_wake_irq(struct device *dev, int irq) ...@@ -192,7 +191,8 @@ int dev_pm_set_dedicated_wake_irq(struct device *dev, int irq)
* so we use a threaded irq. * so we use a threaded irq.
*/ */
err = request_threaded_irq(irq, NULL, handle_threaded_wake_irq, err = request_threaded_irq(irq, NULL, handle_threaded_wake_irq,
IRQF_ONESHOT, wirq->name, wirq); IRQF_ONESHOT | IRQF_NO_AUTOEN,
wirq->name, wirq);
if (err) if (err)
goto err_free_name; goto err_free_name;
......
...@@ -1367,9 +1367,14 @@ static int cpufreq_online(unsigned int cpu) ...@@ -1367,9 +1367,14 @@ static int cpufreq_online(unsigned int cpu)
goto out_free_policy; goto out_free_policy;
} }
/*
* The initialization has succeeded and the policy is online.
* If there is a problem with its frequency table, take it
* offline and drop it.
*/
ret = cpufreq_table_validate_and_sort(policy); ret = cpufreq_table_validate_and_sort(policy);
if (ret) if (ret)
goto out_exit_policy; goto out_offline_policy;
/* related_cpus should at least include policy->cpus. */ /* related_cpus should at least include policy->cpus. */
cpumask_copy(policy->related_cpus, policy->cpus); cpumask_copy(policy->related_cpus, policy->cpus);
...@@ -1515,6 +1520,10 @@ static int cpufreq_online(unsigned int cpu) ...@@ -1515,6 +1520,10 @@ static int cpufreq_online(unsigned int cpu)
up_write(&policy->rwsem); up_write(&policy->rwsem);
out_offline_policy:
if (cpufreq_driver->offline)
cpufreq_driver->offline(policy);
out_exit_policy: out_exit_policy:
if (cpufreq_driver->exit) if (cpufreq_driver->exit)
cpufreq_driver->exit(policy); cpufreq_driver->exit(policy);
......
...@@ -211,7 +211,7 @@ void cpufreq_stats_free_table(struct cpufreq_policy *policy) ...@@ -211,7 +211,7 @@ void cpufreq_stats_free_table(struct cpufreq_policy *policy)
void cpufreq_stats_create_table(struct cpufreq_policy *policy) void cpufreq_stats_create_table(struct cpufreq_policy *policy)
{ {
unsigned int i = 0, count = 0, ret = -ENOMEM; unsigned int i = 0, count;
struct cpufreq_stats *stats; struct cpufreq_stats *stats;
unsigned int alloc_size; unsigned int alloc_size;
struct cpufreq_frequency_table *pos; struct cpufreq_frequency_table *pos;
...@@ -253,8 +253,7 @@ void cpufreq_stats_create_table(struct cpufreq_policy *policy) ...@@ -253,8 +253,7 @@ void cpufreq_stats_create_table(struct cpufreq_policy *policy)
stats->last_index = freq_table_get_index(stats, policy->cur); stats->last_index = freq_table_get_index(stats, policy->cur);
policy->stats = stats; policy->stats = stats;
ret = sysfs_create_group(&policy->kobj, &stats_attr_group); if (!sysfs_create_group(&policy->kobj, &stats_attr_group))
if (!ret)
return; return;
/* We failed, release resources */ /* We failed, release resources */
......
...@@ -121,9 +121,10 @@ struct sample { ...@@ -121,9 +121,10 @@ struct sample {
* @max_pstate_physical:This is physical Max P state for a processor * @max_pstate_physical:This is physical Max P state for a processor
* This can be higher than the max_pstate which can * This can be higher than the max_pstate which can
* be limited by platform thermal design power limits * be limited by platform thermal design power limits
* @scaling: Scaling factor to convert frequency to cpufreq * @perf_ctl_scaling: PERF_CTL P-state to frequency scaling factor
* frequency units * @scaling: Scaling factor between performance and frequency
* @turbo_pstate: Max Turbo P state possible for this platform * @turbo_pstate: Max Turbo P state possible for this platform
* @min_freq: @min_pstate frequency in cpufreq units
* @max_freq: @max_pstate frequency in cpufreq units * @max_freq: @max_pstate frequency in cpufreq units
* @turbo_freq: @turbo_pstate frequency in cpufreq units * @turbo_freq: @turbo_pstate frequency in cpufreq units
* *
...@@ -134,8 +135,10 @@ struct pstate_data { ...@@ -134,8 +135,10 @@ struct pstate_data {
int min_pstate; int min_pstate;
int max_pstate; int max_pstate;
int max_pstate_physical; int max_pstate_physical;
int perf_ctl_scaling;
int scaling; int scaling;
int turbo_pstate; int turbo_pstate;
unsigned int min_freq;
unsigned int max_freq; unsigned int max_freq;
unsigned int turbo_freq; unsigned int turbo_freq;
}; };
...@@ -366,7 +369,7 @@ static void intel_pstate_set_itmt_prio(int cpu) ...@@ -366,7 +369,7 @@ static void intel_pstate_set_itmt_prio(int cpu)
} }
} }
static int intel_pstate_get_cppc_guranteed(int cpu) static int intel_pstate_get_cppc_guaranteed(int cpu)
{ {
struct cppc_perf_caps cppc_perf; struct cppc_perf_caps cppc_perf;
int ret; int ret;
...@@ -382,7 +385,7 @@ static int intel_pstate_get_cppc_guranteed(int cpu) ...@@ -382,7 +385,7 @@ static int intel_pstate_get_cppc_guranteed(int cpu)
} }
#else /* CONFIG_ACPI_CPPC_LIB */ #else /* CONFIG_ACPI_CPPC_LIB */
static void intel_pstate_set_itmt_prio(int cpu) static inline void intel_pstate_set_itmt_prio(int cpu)
{ {
} }
#endif /* CONFIG_ACPI_CPPC_LIB */ #endif /* CONFIG_ACPI_CPPC_LIB */
...@@ -467,6 +470,20 @@ static void intel_pstate_exit_perf_limits(struct cpufreq_policy *policy) ...@@ -467,6 +470,20 @@ static void intel_pstate_exit_perf_limits(struct cpufreq_policy *policy)
acpi_processor_unregister_performance(policy->cpu); acpi_processor_unregister_performance(policy->cpu);
} }
static bool intel_pstate_cppc_perf_valid(u32 perf, struct cppc_perf_caps *caps)
{
return perf && perf <= caps->highest_perf && perf >= caps->lowest_perf;
}
static bool intel_pstate_cppc_perf_caps(struct cpudata *cpu,
struct cppc_perf_caps *caps)
{
if (cppc_get_perf_caps(cpu->cpu, caps))
return false;
return caps->highest_perf && caps->lowest_perf <= caps->highest_perf;
}
#else /* CONFIG_ACPI */ #else /* CONFIG_ACPI */
static inline void intel_pstate_init_acpi_perf_limits(struct cpufreq_policy *policy) static inline void intel_pstate_init_acpi_perf_limits(struct cpufreq_policy *policy)
{ {
...@@ -483,12 +500,146 @@ static inline bool intel_pstate_acpi_pm_profile_server(void) ...@@ -483,12 +500,146 @@ static inline bool intel_pstate_acpi_pm_profile_server(void)
#endif /* CONFIG_ACPI */ #endif /* CONFIG_ACPI */
#ifndef CONFIG_ACPI_CPPC_LIB #ifndef CONFIG_ACPI_CPPC_LIB
static int intel_pstate_get_cppc_guranteed(int cpu) static inline int intel_pstate_get_cppc_guaranteed(int cpu)
{ {
return -ENOTSUPP; return -ENOTSUPP;
} }
#endif /* CONFIG_ACPI_CPPC_LIB */ #endif /* CONFIG_ACPI_CPPC_LIB */
static void intel_pstate_hybrid_hwp_perf_ctl_parity(struct cpudata *cpu)
{
pr_debug("CPU%d: Using PERF_CTL scaling for HWP\n", cpu->cpu);
cpu->pstate.scaling = cpu->pstate.perf_ctl_scaling;
}
/**
* intel_pstate_hybrid_hwp_calibrate - Calibrate HWP performance levels.
* @cpu: Target CPU.
*
* On hybrid processors, HWP may expose more performance levels than there are
* P-states accessible through the PERF_CTL interface. If that happens, the
* scaling factor between HWP performance levels and CPU frequency will be less
* than the scaling factor between P-state values and CPU frequency.
*
* In that case, the scaling factor between HWP performance levels and CPU
* frequency needs to be determined which can be done with the help of the
* observation that certain HWP performance levels should correspond to certain
* P-states, like for example the HWP highest performance should correspond
* to the maximum turbo P-state of the CPU.
*/
static void intel_pstate_hybrid_hwp_calibrate(struct cpudata *cpu)
{
int perf_ctl_max_phys = cpu->pstate.max_pstate_physical;
int perf_ctl_scaling = cpu->pstate.perf_ctl_scaling;
int perf_ctl_turbo = pstate_funcs.get_turbo();
int turbo_freq = perf_ctl_turbo * perf_ctl_scaling;
int perf_ctl_max = pstate_funcs.get_max();
int max_freq = perf_ctl_max * perf_ctl_scaling;
int scaling = INT_MAX;
int freq;
pr_debug("CPU%d: perf_ctl_max_phys = %d\n", cpu->cpu, perf_ctl_max_phys);
pr_debug("CPU%d: perf_ctl_max = %d\n", cpu->cpu, perf_ctl_max);
pr_debug("CPU%d: perf_ctl_turbo = %d\n", cpu->cpu, perf_ctl_turbo);
pr_debug("CPU%d: perf_ctl_scaling = %d\n", cpu->cpu, perf_ctl_scaling);
pr_debug("CPU%d: HWP_CAP guaranteed = %d\n", cpu->cpu, cpu->pstate.max_pstate);
pr_debug("CPU%d: HWP_CAP highest = %d\n", cpu->cpu, cpu->pstate.turbo_pstate);
#ifdef CONFIG_ACPI
if (IS_ENABLED(CONFIG_ACPI_CPPC_LIB)) {
struct cppc_perf_caps caps;
if (intel_pstate_cppc_perf_caps(cpu, &caps)) {
if (intel_pstate_cppc_perf_valid(caps.nominal_perf, &caps)) {
pr_debug("CPU%d: Using CPPC nominal\n", cpu->cpu);
/*
* If the CPPC nominal performance is valid, it
* can be assumed to correspond to cpu_khz.
*/
if (caps.nominal_perf == perf_ctl_max_phys) {
intel_pstate_hybrid_hwp_perf_ctl_parity(cpu);
return;
}
scaling = DIV_ROUND_UP(cpu_khz, caps.nominal_perf);
} else if (intel_pstate_cppc_perf_valid(caps.guaranteed_perf, &caps)) {
pr_debug("CPU%d: Using CPPC guaranteed\n", cpu->cpu);
/*
* If the CPPC guaranteed performance is valid,
* it can be assumed to correspond to max_freq.
*/
if (caps.guaranteed_perf == perf_ctl_max) {
intel_pstate_hybrid_hwp_perf_ctl_parity(cpu);
return;
}
scaling = DIV_ROUND_UP(max_freq, caps.guaranteed_perf);
}
}
}
#endif
/*
* If using the CPPC data to compute the HWP-to-frequency scaling factor
* doesn't work, use the HWP_CAP gauranteed perf for this purpose with
* the assumption that it corresponds to max_freq.
*/
if (scaling > perf_ctl_scaling) {
pr_debug("CPU%d: Using HWP_CAP guaranteed\n", cpu->cpu);
if (cpu->pstate.max_pstate == perf_ctl_max) {
intel_pstate_hybrid_hwp_perf_ctl_parity(cpu);
return;
}
scaling = DIV_ROUND_UP(max_freq, cpu->pstate.max_pstate);
if (scaling > perf_ctl_scaling) {
/*
* This should not happen, because it would mean that
* the number of HWP perf levels was less than the
* number of P-states, so use the PERF_CTL scaling in
* that case.
*/
pr_debug("CPU%d: scaling (%d) out of range\n", cpu->cpu,
scaling);
intel_pstate_hybrid_hwp_perf_ctl_parity(cpu);
return;
}
}
/*
* If the product of the HWP performance scaling factor obtained above
* and the HWP_CAP highest performance is greater than the maximum turbo
* frequency corresponding to the pstate_funcs.get_turbo() return value,
* the scaling factor is too high, so recompute it so that the HWP_CAP
* highest performance corresponds to the maximum turbo frequency.
*/
if (turbo_freq < cpu->pstate.turbo_pstate * scaling) {
pr_debug("CPU%d: scaling too high (%d)\n", cpu->cpu, scaling);
cpu->pstate.turbo_freq = turbo_freq;
scaling = DIV_ROUND_UP(turbo_freq, cpu->pstate.turbo_pstate);
}
cpu->pstate.scaling = scaling;
pr_debug("CPU%d: HWP-to-frequency scaling factor: %d\n", cpu->cpu, scaling);
cpu->pstate.max_freq = rounddown(cpu->pstate.max_pstate * scaling,
perf_ctl_scaling);
freq = perf_ctl_max_phys * perf_ctl_scaling;
cpu->pstate.max_pstate_physical = DIV_ROUND_UP(freq, scaling);
cpu->pstate.min_freq = cpu->pstate.min_pstate * perf_ctl_scaling;
/*
* Cast the min P-state value retrieved via pstate_funcs.get_min() to
* the effective range of HWP performance levels.
*/
cpu->pstate.min_pstate = DIV_ROUND_UP(cpu->pstate.min_freq, scaling);
}
static inline void update_turbo_state(void) static inline void update_turbo_state(void)
{ {
u64 misc_en; u64 misc_en;
...@@ -795,19 +946,22 @@ cpufreq_freq_attr_rw(energy_performance_preference); ...@@ -795,19 +946,22 @@ cpufreq_freq_attr_rw(energy_performance_preference);
static ssize_t show_base_frequency(struct cpufreq_policy *policy, char *buf) static ssize_t show_base_frequency(struct cpufreq_policy *policy, char *buf)
{ {
struct cpudata *cpu; struct cpudata *cpu = all_cpu_data[policy->cpu];
u64 cap; int ratio, freq;
int ratio;
ratio = intel_pstate_get_cppc_guranteed(policy->cpu); ratio = intel_pstate_get_cppc_guaranteed(policy->cpu);
if (ratio <= 0) { if (ratio <= 0) {
u64 cap;
rdmsrl_on_cpu(policy->cpu, MSR_HWP_CAPABILITIES, &cap); rdmsrl_on_cpu(policy->cpu, MSR_HWP_CAPABILITIES, &cap);
ratio = HWP_GUARANTEED_PERF(cap); ratio = HWP_GUARANTEED_PERF(cap);
} }
cpu = all_cpu_data[policy->cpu]; freq = ratio * cpu->pstate.scaling;
if (cpu->pstate.scaling != cpu->pstate.perf_ctl_scaling)
freq = rounddown(freq, cpu->pstate.perf_ctl_scaling);
return sprintf(buf, "%d\n", ratio * cpu->pstate.scaling); return sprintf(buf, "%d\n", freq);
} }
cpufreq_freq_attr_ro(base_frequency); cpufreq_freq_attr_ro(base_frequency);
...@@ -831,9 +985,20 @@ static void __intel_pstate_get_hwp_cap(struct cpudata *cpu) ...@@ -831,9 +985,20 @@ static void __intel_pstate_get_hwp_cap(struct cpudata *cpu)
static void intel_pstate_get_hwp_cap(struct cpudata *cpu) static void intel_pstate_get_hwp_cap(struct cpudata *cpu)
{ {
int scaling = cpu->pstate.scaling;
__intel_pstate_get_hwp_cap(cpu); __intel_pstate_get_hwp_cap(cpu);
cpu->pstate.max_freq = cpu->pstate.max_pstate * cpu->pstate.scaling;
cpu->pstate.turbo_freq = cpu->pstate.turbo_pstate * cpu->pstate.scaling; cpu->pstate.max_freq = cpu->pstate.max_pstate * scaling;
cpu->pstate.turbo_freq = cpu->pstate.turbo_pstate * scaling;
if (scaling != cpu->pstate.perf_ctl_scaling) {
int perf_ctl_scaling = cpu->pstate.perf_ctl_scaling;
cpu->pstate.max_freq = rounddown(cpu->pstate.max_freq,
perf_ctl_scaling);
cpu->pstate.turbo_freq = rounddown(cpu->pstate.turbo_freq,
perf_ctl_scaling);
}
} }
static void intel_pstate_hwp_set(unsigned int cpu) static void intel_pstate_hwp_set(unsigned int cpu)
...@@ -1365,8 +1530,6 @@ define_one_global_rw(energy_efficiency); ...@@ -1365,8 +1530,6 @@ define_one_global_rw(energy_efficiency);
static struct attribute *intel_pstate_attributes[] = { static struct attribute *intel_pstate_attributes[] = {
&status.attr, &status.attr,
&no_turbo.attr, &no_turbo.attr,
&turbo_pct.attr,
&num_pstates.attr,
NULL NULL
}; };
...@@ -1391,6 +1554,14 @@ static void __init intel_pstate_sysfs_expose_params(void) ...@@ -1391,6 +1554,14 @@ static void __init intel_pstate_sysfs_expose_params(void)
if (WARN_ON(rc)) if (WARN_ON(rc))
return; return;
if (!boot_cpu_has(X86_FEATURE_HYBRID_CPU)) {
rc = sysfs_create_file(intel_pstate_kobject, &turbo_pct.attr);
WARN_ON(rc);
rc = sysfs_create_file(intel_pstate_kobject, &num_pstates.attr);
WARN_ON(rc);
}
/* /*
* If per cpu limits are enforced there are no global limits, so * If per cpu limits are enforced there are no global limits, so
* return without creating max/min_perf_pct attributes * return without creating max/min_perf_pct attributes
...@@ -1417,6 +1588,11 @@ static void __init intel_pstate_sysfs_remove(void) ...@@ -1417,6 +1588,11 @@ static void __init intel_pstate_sysfs_remove(void)
sysfs_remove_group(intel_pstate_kobject, &intel_pstate_attr_group); sysfs_remove_group(intel_pstate_kobject, &intel_pstate_attr_group);
if (!boot_cpu_has(X86_FEATURE_HYBRID_CPU)) {
sysfs_remove_file(intel_pstate_kobject, &num_pstates.attr);
sysfs_remove_file(intel_pstate_kobject, &turbo_pct.attr);
}
if (!per_cpu_limits) { if (!per_cpu_limits) {
sysfs_remove_file(intel_pstate_kobject, &max_perf_pct.attr); sysfs_remove_file(intel_pstate_kobject, &max_perf_pct.attr);
sysfs_remove_file(intel_pstate_kobject, &min_perf_pct.attr); sysfs_remove_file(intel_pstate_kobject, &min_perf_pct.attr);
...@@ -1713,19 +1889,33 @@ static void intel_pstate_max_within_limits(struct cpudata *cpu) ...@@ -1713,19 +1889,33 @@ static void intel_pstate_max_within_limits(struct cpudata *cpu)
static void intel_pstate_get_cpu_pstates(struct cpudata *cpu) static void intel_pstate_get_cpu_pstates(struct cpudata *cpu)
{ {
bool hybrid_cpu = boot_cpu_has(X86_FEATURE_HYBRID_CPU);
int perf_ctl_max_phys = pstate_funcs.get_max_physical();
int perf_ctl_scaling = hybrid_cpu ? cpu_khz / perf_ctl_max_phys :
pstate_funcs.get_scaling();
cpu->pstate.min_pstate = pstate_funcs.get_min(); cpu->pstate.min_pstate = pstate_funcs.get_min();
cpu->pstate.max_pstate_physical = pstate_funcs.get_max_physical(); cpu->pstate.max_pstate_physical = perf_ctl_max_phys;
cpu->pstate.scaling = pstate_funcs.get_scaling(); cpu->pstate.perf_ctl_scaling = perf_ctl_scaling;
if (hwp_active && !hwp_mode_bdw) { if (hwp_active && !hwp_mode_bdw) {
__intel_pstate_get_hwp_cap(cpu); __intel_pstate_get_hwp_cap(cpu);
if (hybrid_cpu)
intel_pstate_hybrid_hwp_calibrate(cpu);
else
cpu->pstate.scaling = perf_ctl_scaling;
} else { } else {
cpu->pstate.scaling = perf_ctl_scaling;
cpu->pstate.max_pstate = pstate_funcs.get_max(); cpu->pstate.max_pstate = pstate_funcs.get_max();
cpu->pstate.turbo_pstate = pstate_funcs.get_turbo(); cpu->pstate.turbo_pstate = pstate_funcs.get_turbo();
} }
cpu->pstate.max_freq = cpu->pstate.max_pstate * cpu->pstate.scaling; if (cpu->pstate.scaling == perf_ctl_scaling) {
cpu->pstate.turbo_freq = cpu->pstate.turbo_pstate * cpu->pstate.scaling; cpu->pstate.min_freq = cpu->pstate.min_pstate * perf_ctl_scaling;
cpu->pstate.max_freq = cpu->pstate.max_pstate * perf_ctl_scaling;
cpu->pstate.turbo_freq = cpu->pstate.turbo_pstate * perf_ctl_scaling;
}
if (pstate_funcs.get_aperf_mperf_shift) if (pstate_funcs.get_aperf_mperf_shift)
cpu->aperf_mperf_shift = pstate_funcs.get_aperf_mperf_shift(); cpu->aperf_mperf_shift = pstate_funcs.get_aperf_mperf_shift();
...@@ -2087,6 +2277,8 @@ static const struct x86_cpu_id intel_pstate_cpu_ids[] = { ...@@ -2087,6 +2277,8 @@ static const struct x86_cpu_id intel_pstate_cpu_ids[] = {
X86_MATCH(ATOM_GOLDMONT, core_funcs), X86_MATCH(ATOM_GOLDMONT, core_funcs),
X86_MATCH(ATOM_GOLDMONT_PLUS, core_funcs), X86_MATCH(ATOM_GOLDMONT_PLUS, core_funcs),
X86_MATCH(SKYLAKE_X, core_funcs), X86_MATCH(SKYLAKE_X, core_funcs),
X86_MATCH(COMETLAKE, core_funcs),
X86_MATCH(ICELAKE_X, core_funcs),
{} {}
}; };
MODULE_DEVICE_TABLE(x86cpu, intel_pstate_cpu_ids); MODULE_DEVICE_TABLE(x86cpu, intel_pstate_cpu_ids);
...@@ -2195,23 +2387,34 @@ static void intel_pstate_update_perf_limits(struct cpudata *cpu, ...@@ -2195,23 +2387,34 @@ static void intel_pstate_update_perf_limits(struct cpudata *cpu,
unsigned int policy_min, unsigned int policy_min,
unsigned int policy_max) unsigned int policy_max)
{ {
int scaling = cpu->pstate.scaling; int perf_ctl_scaling = cpu->pstate.perf_ctl_scaling;
int32_t max_policy_perf, min_policy_perf; int32_t max_policy_perf, min_policy_perf;
max_policy_perf = policy_max / perf_ctl_scaling;
if (policy_max == policy_min) {
min_policy_perf = max_policy_perf;
} else {
min_policy_perf = policy_min / perf_ctl_scaling;
min_policy_perf = clamp_t(int32_t, min_policy_perf,
0, max_policy_perf);
}
/* /*
* HWP needs some special consideration, because HWP_REQUEST uses * HWP needs some special consideration, because HWP_REQUEST uses
* abstract values to represent performance rather than pure ratios. * abstract values to represent performance rather than pure ratios.
*/ */
if (hwp_active) if (hwp_active) {
intel_pstate_get_hwp_cap(cpu); intel_pstate_get_hwp_cap(cpu);
max_policy_perf = policy_max / scaling; if (cpu->pstate.scaling != perf_ctl_scaling) {
if (policy_max == policy_min) { int scaling = cpu->pstate.scaling;
min_policy_perf = max_policy_perf; int freq;
} else {
min_policy_perf = policy_min / scaling; freq = max_policy_perf * perf_ctl_scaling;
min_policy_perf = clamp_t(int32_t, min_policy_perf, max_policy_perf = DIV_ROUND_UP(freq, scaling);
0, max_policy_perf); freq = min_policy_perf * perf_ctl_scaling;
min_policy_perf = DIV_ROUND_UP(freq, scaling);
}
} }
pr_debug("cpu:%d min_policy_perf:%d max_policy_perf:%d\n", pr_debug("cpu:%d min_policy_perf:%d max_policy_perf:%d\n",
...@@ -2405,7 +2608,7 @@ static int __intel_pstate_cpu_init(struct cpufreq_policy *policy) ...@@ -2405,7 +2608,7 @@ static int __intel_pstate_cpu_init(struct cpufreq_policy *policy)
cpu->min_perf_ratio = 0; cpu->min_perf_ratio = 0;
/* cpuinfo and default policy values */ /* cpuinfo and default policy values */
policy->cpuinfo.min_freq = cpu->pstate.min_pstate * cpu->pstate.scaling; policy->cpuinfo.min_freq = cpu->pstate.min_freq;
update_turbo_state(); update_turbo_state();
global.turbo_disabled_mf = global.turbo_disabled; global.turbo_disabled_mf = global.turbo_disabled;
policy->cpuinfo.max_freq = global.turbo_disabled ? policy->cpuinfo.max_freq = global.turbo_disabled ?
...@@ -3135,6 +3338,8 @@ static int __init intel_pstate_init(void) ...@@ -3135,6 +3338,8 @@ static int __init intel_pstate_init(void)
} }
pr_info("HWP enabled\n"); pr_info("HWP enabled\n");
} else if (boot_cpu_has(X86_FEATURE_HYBRID_CPU)) {
pr_warn("Problematic setup: Hybrid processor with disabled HWP\n");
} }
return 0; return 0;
......
...@@ -16,7 +16,6 @@ ...@@ -16,7 +16,6 @@
#include <linux/cpufreq.h> #include <linux/cpufreq.h>
#include <linux/module.h> #include <linux/module.h>
#include <linux/err.h> #include <linux/err.h>
#include <linux/sched.h> /* set_cpus_allowed() */
#include <linux/delay.h> #include <linux/delay.h>
#include <linux/platform_device.h> #include <linux/platform_device.h>
......
...@@ -42,6 +42,7 @@ static unsigned int sc520_freq_get_cpu_frequency(unsigned int cpu) ...@@ -42,6 +42,7 @@ static unsigned int sc520_freq_get_cpu_frequency(unsigned int cpu)
default: default:
pr_err("error: cpuctl register has unexpected value %02x\n", pr_err("error: cpuctl register has unexpected value %02x\n",
clockspeed_reg); clockspeed_reg);
fallthrough;
case 0x01: case 0x01:
return 100000; return 100000;
case 0x02: case 0x02:
......
...@@ -23,7 +23,6 @@ ...@@ -23,7 +23,6 @@
#include <linux/cpumask.h> #include <linux/cpumask.h>
#include <linux/cpu.h> #include <linux/cpu.h>
#include <linux/smp.h> #include <linux/smp.h>
#include <linux/sched.h> /* set_cpus_allowed() */
#include <linux/clk.h> #include <linux/clk.h>
#include <linux/percpu.h> #include <linux/percpu.h>
#include <linux/sh_clk.h> #include <linux/sh_clk.h>
......
...@@ -2,47 +2,103 @@ ...@@ -2,47 +2,103 @@
/* /*
* Timer events oriented CPU idle governor * Timer events oriented CPU idle governor
* *
* Copyright (C) 2018 Intel Corporation * Copyright (C) 2018 - 2021 Intel Corporation
* Author: Rafael J. Wysocki <rafael.j.wysocki@intel.com> * Author: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
*/
/**
* DOC: teo-description
* *
* The idea of this governor is based on the observation that on many systems * The idea of this governor is based on the observation that on many systems
* timer events are two or more orders of magnitude more frequent than any * timer events are two or more orders of magnitude more frequent than any
* other interrupts, so they are likely to be the most significant source of CPU * other interrupts, so they are likely to be the most significant cause of CPU
* wakeups from idle states. Moreover, information about what happened in the * wakeups from idle states. Moreover, information about what happened in the
* (relatively recent) past can be used to estimate whether or not the deepest * (relatively recent) past can be used to estimate whether or not the deepest
* idle state with target residency within the time to the closest timer is * idle state with target residency within the (known) time till the closest
* likely to be suitable for the upcoming idle time of the CPU and, if not, then * timer event, referred to as the sleep length, is likely to be suitable for
* which of the shallower idle states to choose. * the upcoming CPU idle period and, if not, then which of the shallower idle
* states to choose instead of it.
*
* Of course, non-timer wakeup sources are more important in some use cases
* which can be covered by taking a few most recent idle time intervals of the
* CPU into account. However, even in that context it is not necessary to
* consider idle duration values greater than the sleep length, because the
* closest timer will ultimately wake up the CPU anyway unless it is woken up
* earlier.
*
* Thus this governor estimates whether or not the prospective idle duration of
* a CPU is likely to be significantly shorter than the sleep length and selects
* an idle state for it accordingly.
*
* The computations carried out by this governor are based on using bins whose
* boundaries are aligned with the target residency parameter values of the CPU
* idle states provided by the %CPUIdle driver in the ascending order. That is,
* the first bin spans from 0 up to, but not including, the target residency of
* the second idle state (idle state 1), the second bin spans from the target
* residency of idle state 1 up to, but not including, the target residency of
* idle state 2, the third bin spans from the target residency of idle state 2
* up to, but not including, the target residency of idle state 3 and so on.
* The last bin spans from the target residency of the deepest idle state
* supplied by the driver to infinity.
*
* Two metrics called "hits" and "intercepts" are associated with each bin.
* They are updated every time before selecting an idle state for the given CPU
* in accordance with what happened last time.
*
* The "hits" metric reflects the relative frequency of situations in which the
* sleep length and the idle duration measured after CPU wakeup fall into the
* same bin (that is, the CPU appears to wake up "on time" relative to the sleep
* length). In turn, the "intercepts" metric reflects the relative frequency of
* situations in which the measured idle duration is so much shorter than the
* sleep length that the bin it falls into corresponds to an idle state
* shallower than the one whose bin is fallen into by the sleep length (these
* situations are referred to as "intercepts" below).
*
* In addition to the metrics described above, the governor counts recent
* intercepts (that is, intercepts that have occurred during the last
* %NR_RECENT invocations of it for the given CPU) for each bin.
*
* In order to select an idle state for a CPU, the governor takes the following
* steps (modulo the possible latency constraint that must be taken into account
* too):
* *
* Of course, non-timer wakeup sources are more important in some use cases and * 1. Find the deepest CPU idle state whose target residency does not exceed
* they can be covered by taking a few most recent idle time intervals of the * the current sleep length (the candidate idle state) and compute 3 sums as
* CPU into account. However, even in that case it is not necessary to consider * follows:
* idle duration values greater than the time till the closest timer, as the
* patterns that they may belong to produce average values close enough to
* the time till the closest timer (sleep length) anyway.
* *
* Thus this governor estimates whether or not the upcoming idle time of the CPU * - The sum of the "hits" and "intercepts" metrics for the candidate state
* is likely to be significantly shorter than the sleep length and selects an * and all of the deeper idle states (it represents the cases in which the
* idle state for it in accordance with that, as follows: * CPU was idle long enough to avoid being intercepted if the sleep length
* had been equal to the current one).
* *
* - Find an idle state on the basis of the sleep length and state statistics * - The sum of the "intercepts" metrics for all of the idle states shallower
* collected over time: * than the candidate one (it represents the cases in which the CPU was not
* idle long enough to avoid being intercepted if the sleep length had been
* equal to the current one).
* *
* o Find the deepest idle state whose target residency is less than or equal * - The sum of the numbers of recent intercepts for all of the idle states
* to the sleep length. * shallower than the candidate one.
* *
* o Select it if it matched both the sleep length and the observed idle * 2. If the second sum is greater than the first one or the third sum is
* duration in the past more often than it matched the sleep length alone * greater than %NR_RECENT / 2, the CPU is likely to wake up early, so look
* (i.e. the observed idle duration was significantly shorter than the sleep * for an alternative idle state to select.
* length matched by it).
* *
* o Otherwise, select the shallower state with the greatest matched "early" * - Traverse the idle states shallower than the candidate one in the
* wakeups metric. * descending order.
* *
* - If the majority of the most recent idle duration values are below the * - For each of them compute the sum of the "intercepts" metrics and the sum
* target residency of the idle state selected so far, use those values to * of the numbers of recent intercepts over all of the idle states between
* compute the new expected idle duration and find an idle state matching it * it and the candidate one (including the former and excluding the
* (which has to be shallower than the one selected so far). * latter).
*
* - If each of these sums that needs to be taken into account (because the
* check related to it has indicated that the CPU is likely to wake up
* early) is greater than a half of the corresponding sum computed in step
* 1 (which means that the target residency of the state in question had
* not exceeded the idle duration in over a half of the relevant cases),
* select the given idle state instead of the candidate one.
*
* 3. By default, select the candidate state.
*/ */
#include <linux/cpuidle.h> #include <linux/cpuidle.h>
...@@ -60,65 +116,51 @@ ...@@ -60,65 +116,51 @@
/* /*
* Number of the most recent idle duration values to take into consideration for * Number of the most recent idle duration values to take into consideration for
* the detection of wakeup patterns. * the detection of recent early wakeup patterns.
*/ */
#define INTERVALS 8 #define NR_RECENT 9
/** /**
* struct teo_idle_state - Idle state data used by the TEO cpuidle governor. * struct teo_bin - Metrics used by the TEO cpuidle governor.
* @early_hits: "Early" CPU wakeups "matching" this state. * @intercepts: The "intercepts" metric.
* @hits: "On time" CPU wakeups "matching" this state. * @hits: The "hits" metric.
* @misses: CPU wakeups "missing" this state. * @recent: The number of recent "intercepts".
*
* A CPU wakeup is "matched" by a given idle state if the idle duration measured
* after the wakeup is between the target residency of that state and the target
* residency of the next one (or if this is the deepest available idle state, it
* "matches" a CPU wakeup when the measured idle duration is at least equal to
* its target residency).
*
* Also, from the TEO governor perspective, a CPU wakeup from idle is "early" if
* it occurs significantly earlier than the closest expected timer event (that
* is, early enough to match an idle state shallower than the one matching the
* time till the closest timer event). Otherwise, the wakeup is "on time", or
* it is a "hit".
*
* A "miss" occurs when the given state doesn't match the wakeup, but it matches
* the time till the closest timer event used for idle state selection.
*/ */
struct teo_idle_state { struct teo_bin {
unsigned int early_hits; unsigned int intercepts;
unsigned int hits; unsigned int hits;
unsigned int misses; unsigned int recent;
}; };
/** /**
* struct teo_cpu - CPU data used by the TEO cpuidle governor. * struct teo_cpu - CPU data used by the TEO cpuidle governor.
* @time_span_ns: Time between idle state selection and post-wakeup update. * @time_span_ns: Time between idle state selection and post-wakeup update.
* @sleep_length_ns: Time till the closest timer event (at the selection time). * @sleep_length_ns: Time till the closest timer event (at the selection time).
* @states: Idle states data corresponding to this CPU. * @state_bins: Idle state data bins for this CPU.
* @interval_idx: Index of the most recent saved idle interval. * @total: Grand total of the "intercepts" and "hits" mertics for all bins.
* @intervals: Saved idle duration values. * @next_recent_idx: Index of the next @recent_idx entry to update.
* @recent_idx: Indices of bins corresponding to recent "intercepts".
*/ */
struct teo_cpu { struct teo_cpu {
s64 time_span_ns; s64 time_span_ns;
s64 sleep_length_ns; s64 sleep_length_ns;
struct teo_idle_state states[CPUIDLE_STATE_MAX]; struct teo_bin state_bins[CPUIDLE_STATE_MAX];
int interval_idx; unsigned int total;
u64 intervals[INTERVALS]; int next_recent_idx;
int recent_idx[NR_RECENT];
}; };
static DEFINE_PER_CPU(struct teo_cpu, teo_cpus); static DEFINE_PER_CPU(struct teo_cpu, teo_cpus);
/** /**
* teo_update - Update CPU data after wakeup. * teo_update - Update CPU metrics after wakeup.
* @drv: cpuidle driver containing state data. * @drv: cpuidle driver containing state data.
* @dev: Target CPU. * @dev: Target CPU.
*/ */
static void teo_update(struct cpuidle_driver *drv, struct cpuidle_device *dev) static void teo_update(struct cpuidle_driver *drv, struct cpuidle_device *dev)
{ {
struct teo_cpu *cpu_data = per_cpu_ptr(&teo_cpus, dev->cpu); struct teo_cpu *cpu_data = per_cpu_ptr(&teo_cpus, dev->cpu);
int i, idx_hit = 0, idx_timer = 0; int i, idx_timer = 0, idx_duration = 0;
unsigned int hits, misses;
u64 measured_ns; u64 measured_ns;
if (cpu_data->time_span_ns >= cpu_data->sleep_length_ns) { if (cpu_data->time_span_ns >= cpu_data->sleep_length_ns) {
...@@ -151,53 +193,52 @@ static void teo_update(struct cpuidle_driver *drv, struct cpuidle_device *dev) ...@@ -151,53 +193,52 @@ static void teo_update(struct cpuidle_driver *drv, struct cpuidle_device *dev)
measured_ns /= 2; measured_ns /= 2;
} }
cpu_data->total = 0;
/* /*
* Decay the "early hits" metric for all of the states and find the * Decay the "hits" and "intercepts" metrics for all of the bins and
* states matching the sleep length and the measured idle duration. * find the bins that the sleep length and the measured idle duration
* fall into.
*/ */
for (i = 0; i < drv->state_count; i++) { for (i = 0; i < drv->state_count; i++) {
unsigned int early_hits = cpu_data->states[i].early_hits; s64 target_residency_ns = drv->states[i].target_residency_ns;
struct teo_bin *bin = &cpu_data->state_bins[i];
bin->hits -= bin->hits >> DECAY_SHIFT;
bin->intercepts -= bin->intercepts >> DECAY_SHIFT;
cpu_data->states[i].early_hits -= early_hits >> DECAY_SHIFT; cpu_data->total += bin->hits + bin->intercepts;
if (drv->states[i].target_residency_ns <= cpu_data->sleep_length_ns) { if (target_residency_ns <= cpu_data->sleep_length_ns) {
idx_timer = i; idx_timer = i;
if (drv->states[i].target_residency_ns <= measured_ns) if (target_residency_ns <= measured_ns)
idx_hit = i; idx_duration = i;
} }
} }
/* i = cpu_data->next_recent_idx++;
* Update the "hits" and "misses" data for the state matching the sleep if (cpu_data->next_recent_idx >= NR_RECENT)
* length. If it matches the measured idle duration too, this is a hit, cpu_data->next_recent_idx = 0;
* so increase the "hits" metric for it then. Otherwise, this is a
* miss, so increase the "misses" metric for it. In the latter case
* also increase the "early hits" metric for the state that actually
* matches the measured idle duration.
*/
hits = cpu_data->states[idx_timer].hits;
hits -= hits >> DECAY_SHIFT;
misses = cpu_data->states[idx_timer].misses; if (cpu_data->recent_idx[i] >= 0)
misses -= misses >> DECAY_SHIFT; cpu_data->state_bins[cpu_data->recent_idx[i]].recent--;
if (idx_timer == idx_hit) { /*
hits += PULSE; * If the measured idle duration falls into the same bin as the sleep
* length, this is a "hit", so update the "hits" metric for that bin.
* Otherwise, update the "intercepts" metric for the bin fallen into by
* the measured idle duration.
*/
if (idx_timer == idx_duration) {
cpu_data->state_bins[idx_timer].hits += PULSE;
cpu_data->recent_idx[i] = -1;
} else { } else {
misses += PULSE; cpu_data->state_bins[idx_duration].intercepts += PULSE;
cpu_data->states[idx_hit].early_hits += PULSE; cpu_data->state_bins[idx_duration].recent++;
cpu_data->recent_idx[i] = idx_duration;
} }
cpu_data->states[idx_timer].misses = misses; cpu_data->total += PULSE;
cpu_data->states[idx_timer].hits = hits;
/*
* Save idle duration values corresponding to non-timer wakeups for
* pattern detection.
*/
cpu_data->intervals[cpu_data->interval_idx++] = measured_ns;
if (cpu_data->interval_idx >= INTERVALS)
cpu_data->interval_idx = 0;
} }
static bool teo_time_ok(u64 interval_ns) static bool teo_time_ok(u64 interval_ns)
...@@ -205,6 +246,12 @@ static bool teo_time_ok(u64 interval_ns) ...@@ -205,6 +246,12 @@ static bool teo_time_ok(u64 interval_ns)
return !tick_nohz_tick_stopped() || interval_ns >= TICK_NSEC; return !tick_nohz_tick_stopped() || interval_ns >= TICK_NSEC;
} }
static s64 teo_middle_of_bin(int idx, struct cpuidle_driver *drv)
{
return (drv->states[idx].target_residency_ns +
drv->states[idx+1].target_residency_ns) / 2;
}
/** /**
* teo_find_shallower_state - Find shallower idle state matching given duration. * teo_find_shallower_state - Find shallower idle state matching given duration.
* @drv: cpuidle driver containing state data. * @drv: cpuidle driver containing state data.
...@@ -240,10 +287,18 @@ static int teo_select(struct cpuidle_driver *drv, struct cpuidle_device *dev, ...@@ -240,10 +287,18 @@ static int teo_select(struct cpuidle_driver *drv, struct cpuidle_device *dev,
{ {
struct teo_cpu *cpu_data = per_cpu_ptr(&teo_cpus, dev->cpu); struct teo_cpu *cpu_data = per_cpu_ptr(&teo_cpus, dev->cpu);
s64 latency_req = cpuidle_governor_latency_req(dev->cpu); s64 latency_req = cpuidle_governor_latency_req(dev->cpu);
int max_early_idx, prev_max_early_idx, constraint_idx, idx0, idx, i; unsigned int idx_intercept_sum = 0;
unsigned int hits, misses, early_hits; unsigned int intercept_sum = 0;
unsigned int idx_recent_sum = 0;
unsigned int recent_sum = 0;
unsigned int idx_hit_sum = 0;
unsigned int hit_sum = 0;
int constraint_idx = 0;
int idx0 = 0, idx = -1;
bool alt_intercepts, alt_recent;
ktime_t delta_tick; ktime_t delta_tick;
s64 duration_ns; s64 duration_ns;
int i;
if (dev->last_state_idx >= 0) { if (dev->last_state_idx >= 0) {
teo_update(drv, dev); teo_update(drv, dev);
...@@ -255,170 +310,135 @@ static int teo_select(struct cpuidle_driver *drv, struct cpuidle_device *dev, ...@@ -255,170 +310,135 @@ static int teo_select(struct cpuidle_driver *drv, struct cpuidle_device *dev,
duration_ns = tick_nohz_get_sleep_length(&delta_tick); duration_ns = tick_nohz_get_sleep_length(&delta_tick);
cpu_data->sleep_length_ns = duration_ns; cpu_data->sleep_length_ns = duration_ns;
hits = 0; /* Check if there is any choice in the first place. */
misses = 0; if (drv->state_count < 2) {
early_hits = 0; idx = 0;
max_early_idx = -1; goto end;
prev_max_early_idx = -1; }
constraint_idx = drv->state_count; if (!dev->states_usage[0].disable) {
idx = -1; idx = 0;
idx0 = idx; if (drv->states[1].target_residency_ns > duration_ns)
goto end;
for (i = 0; i < drv->state_count; i++) { }
struct cpuidle_state *s = &drv->states[i];
if (dev->states_usage[i].disable) {
/*
* Ignore disabled states with target residencies beyond
* the anticipated idle duration.
*/
if (s->target_residency_ns > duration_ns)
continue;
/*
* This state is disabled, so the range of idle duration
* values corresponding to it is covered by the current
* candidate state, but still the "hits" and "misses"
* metrics of the disabled state need to be used to
* decide whether or not the state covering the range in
* question is good enough.
*/
hits = cpu_data->states[i].hits;
misses = cpu_data->states[i].misses;
if (early_hits >= cpu_data->states[i].early_hits ||
idx < 0)
continue;
/* /*
* If the current candidate state has been the one with * Find the deepest idle state whose target residency does not exceed
* the maximum "early hits" metric so far, the "early * the current sleep length and the deepest idle state not deeper than
* hits" metric of the disabled state replaces the * the former whose exit latency does not exceed the current latency
* current "early hits" count to avoid selecting a * constraint. Compute the sums of metrics for early wakeup pattern
* deeper state with lower "early hits" metric. * detection.
*/ */
if (max_early_idx == idx) { for (i = 1; i < drv->state_count; i++) {
early_hits = cpu_data->states[i].early_hits; struct teo_bin *prev_bin = &cpu_data->state_bins[i-1];
continue; struct cpuidle_state *s = &drv->states[i];
}
/* /*
* The current candidate state is closer to the disabled * Update the sums of idle state mertics for all of the states
* one than the current maximum "early hits" state, so * shallower than the current one.
* replace the latter with it, but in case the maximum
* "early hits" state index has not been set so far,
* check if the current candidate state is not too
* shallow for that role.
*/ */
if (teo_time_ok(drv->states[idx].target_residency_ns)) { intercept_sum += prev_bin->intercepts;
prev_max_early_idx = max_early_idx; hit_sum += prev_bin->hits;
early_hits = cpu_data->states[i].early_hits; recent_sum += prev_bin->recent;
max_early_idx = idx;
}
if (dev->states_usage[i].disable)
continue; continue;
}
if (idx < 0) { if (idx < 0) {
idx = i; /* first enabled state */ idx = i; /* first enabled state */
hits = cpu_data->states[i].hits;
misses = cpu_data->states[i].misses;
idx0 = i; idx0 = i;
} }
if (s->target_residency_ns > duration_ns) if (s->target_residency_ns > duration_ns)
break; break;
if (s->exit_latency_ns > latency_req && constraint_idx > i)
constraint_idx = i;
idx = i; idx = i;
hits = cpu_data->states[i].hits;
misses = cpu_data->states[i].misses;
if (early_hits < cpu_data->states[i].early_hits &&
teo_time_ok(drv->states[i].target_residency_ns)) {
prev_max_early_idx = max_early_idx;
early_hits = cpu_data->states[i].early_hits;
max_early_idx = i;
}
}
/* if (s->exit_latency_ns <= latency_req)
* If the "hits" metric of the idle state matching the sleep length is constraint_idx = i;
* greater than its "misses" metric, that is the one to use. Otherwise,
* it is more likely that one of the shallower states will match the
* idle duration observed after wakeup, so take the one with the maximum
* "early hits" metric, but if that cannot be determined, just use the
* state selected so far.
*/
if (hits <= misses) {
/*
* The current candidate state is not suitable, so take the one
* whose "early hits" metric is the maximum for the range of
* shallower states.
*/
if (idx == max_early_idx)
max_early_idx = prev_max_early_idx;
if (max_early_idx >= 0) { idx_intercept_sum = intercept_sum;
idx = max_early_idx; idx_hit_sum = hit_sum;
duration_ns = drv->states[idx].target_residency_ns; idx_recent_sum = recent_sum;
} }
/* Avoid unnecessary overhead. */
if (idx < 0) {
idx = 0; /* No states enabled, must use 0. */
goto end;
} else if (idx == idx0) {
goto end;
} }
/* /*
* If there is a latency constraint, it may be necessary to use a * If the sum of the intercepts metric for all of the idle states
* shallower idle state than the one selected so far. * shallower than the current candidate one (idx) is greater than the
* sum of the intercepts and hits metrics for the candidate state and
* all of the deeper states, or the sum of the numbers of recent
* intercepts over all of the states shallower than the candidate one
* is greater than a half of the number of recent events taken into
* account, the CPU is likely to wake up early, so find an alternative
* idle state to select.
*/ */
if (constraint_idx < idx) alt_intercepts = 2 * idx_intercept_sum > cpu_data->total - idx_hit_sum;
idx = constraint_idx; alt_recent = idx_recent_sum > NR_RECENT / 2;
if (alt_recent || alt_intercepts) {
if (idx < 0) { s64 last_enabled_span_ns = duration_ns;
idx = 0; /* No states enabled. Must use 0. */ int last_enabled_idx = idx;
} else if (idx > idx0) {
unsigned int count = 0;
u64 sum = 0;
/* /*
* The target residencies of at least two different enabled idle * Look for the deepest idle state whose target residency had
* states are less than or equal to the current expected idle * not exceeded the idle duration in over a half of the relevant
* duration. Try to refine the selection using the most recent * cases (both with respect to intercepts overall and with
* measured idle duration values. * respect to the recent intercepts only) in the past.
* *
* Count and sum the most recent idle duration values less than * Take the possible latency constraint and duration limitation
* the current expected idle duration value. * present if the tick has been stopped already into account.
*/ */
for (i = 0; i < INTERVALS; i++) { intercept_sum = 0;
u64 val = cpu_data->intervals[i]; recent_sum = 0;
if (val >= duration_ns) for (i = idx - 1; i >= idx0; i--) {
continue; struct teo_bin *bin = &cpu_data->state_bins[i];
s64 span_ns;
count++; intercept_sum += bin->intercepts;
sum += val; recent_sum += bin->recent;
}
/* if (dev->states_usage[i].disable)
* Give up unless the majority of the most recent idle duration continue;
* values are in the interesting range.
*/
if (count > INTERVALS / 2) {
u64 avg_ns = div64_u64(sum, count);
span_ns = teo_middle_of_bin(i, drv);
if (!teo_time_ok(span_ns)) {
/* /*
* Avoid spending too much time in an idle state that * The current state is too shallow, so select
* would be too shallow. * the first enabled deeper state.
*/ */
if (teo_time_ok(avg_ns)) { duration_ns = last_enabled_span_ns;
duration_ns = avg_ns; idx = last_enabled_idx;
if (drv->states[idx].target_residency_ns > avg_ns) break;
idx = teo_find_shallower_state(drv, dev,
idx, avg_ns);
} }
if ((!alt_recent || 2 * recent_sum > idx_recent_sum) &&
(!alt_intercepts ||
2 * intercept_sum > idx_intercept_sum)) {
idx = i;
duration_ns = span_ns;
break;
}
last_enabled_span_ns = span_ns;
last_enabled_idx = i;
} }
} }
/*
* If there is a latency constraint, it may be necessary to select an
* idle state shallower than the current candidate one.
*/
if (idx > constraint_idx)
idx = constraint_idx;
end:
/* /*
* Don't stop the tick if the selected state is a polling one or if the * Don't stop the tick if the selected state is a polling one or if the
* expected idle duration is shorter than the tick period length. * expected idle duration is shorter than the tick period length.
...@@ -478,8 +498,8 @@ static int teo_enable_device(struct cpuidle_driver *drv, ...@@ -478,8 +498,8 @@ static int teo_enable_device(struct cpuidle_driver *drv,
memset(cpu_data, 0, sizeof(*cpu_data)); memset(cpu_data, 0, sizeof(*cpu_data));
for (i = 0; i < INTERVALS; i++) for (i = 0; i < NR_RECENT; i++)
cpu_data->intervals[i] = U64_MAX; cpu_data->recent_idx[i] = -1;
return 0; return 0;
} }
......
...@@ -103,7 +103,6 @@ config ARM_IMX8M_DDRC_DEVFREQ ...@@ -103,7 +103,6 @@ config ARM_IMX8M_DDRC_DEVFREQ
tristate "i.MX8M DDRC DEVFREQ Driver" tristate "i.MX8M DDRC DEVFREQ Driver"
depends on (ARCH_MXC && HAVE_ARM_SMCCC) || \ depends on (ARCH_MXC && HAVE_ARM_SMCCC) || \
(COMPILE_TEST && HAVE_ARM_SMCCC) (COMPILE_TEST && HAVE_ARM_SMCCC)
select DEVFREQ_GOV_SIMPLE_ONDEMAND
select DEVFREQ_GOV_USERSPACE select DEVFREQ_GOV_USERSPACE
help help
This adds the DEVFREQ driver for the i.MX8M DDR Controller. It allows This adds the DEVFREQ driver for the i.MX8M DDR Controller. It allows
......
...@@ -823,6 +823,7 @@ struct devfreq *devfreq_add_device(struct device *dev, ...@@ -823,6 +823,7 @@ struct devfreq *devfreq_add_device(struct device *dev,
if (devfreq->profile->timer < 0 if (devfreq->profile->timer < 0
|| devfreq->profile->timer >= DEVFREQ_TIMER_NUM) { || devfreq->profile->timer >= DEVFREQ_TIMER_NUM) {
mutex_unlock(&devfreq->lock); mutex_unlock(&devfreq->lock);
err = -EINVAL;
goto err_dev; goto err_dev;
} }
......
...@@ -65,7 +65,7 @@ static int devfreq_passive_get_target_freq(struct devfreq *devfreq, ...@@ -65,7 +65,7 @@ static int devfreq_passive_get_target_freq(struct devfreq *devfreq,
dev_pm_opp_put(p_opp); dev_pm_opp_put(p_opp);
if (IS_ERR(opp)) if (IS_ERR(opp))
return PTR_ERR(opp); goto no_required_opp;
*freq = dev_pm_opp_get_freq(opp); *freq = dev_pm_opp_get_freq(opp);
dev_pm_opp_put(opp); dev_pm_opp_put(opp);
...@@ -73,6 +73,7 @@ static int devfreq_passive_get_target_freq(struct devfreq *devfreq, ...@@ -73,6 +73,7 @@ static int devfreq_passive_get_target_freq(struct devfreq *devfreq,
return 0; return 0;
} }
no_required_opp:
/* /*
* Get the OPP table's index of decided frequency by governor * Get the OPP table's index of decided frequency by governor
* of parent device. * of parent device.
......
...@@ -31,7 +31,7 @@ static int devfreq_userspace_func(struct devfreq *df, unsigned long *freq) ...@@ -31,7 +31,7 @@ static int devfreq_userspace_func(struct devfreq *df, unsigned long *freq)
return 0; return 0;
} }
static ssize_t store_freq(struct device *dev, struct device_attribute *attr, static ssize_t set_freq_store(struct device *dev, struct device_attribute *attr,
const char *buf, size_t count) const char *buf, size_t count)
{ {
struct devfreq *devfreq = to_devfreq(dev); struct devfreq *devfreq = to_devfreq(dev);
...@@ -52,8 +52,8 @@ static ssize_t store_freq(struct device *dev, struct device_attribute *attr, ...@@ -52,8 +52,8 @@ static ssize_t store_freq(struct device *dev, struct device_attribute *attr,
return err; return err;
} }
static ssize_t show_freq(struct device *dev, struct device_attribute *attr, static ssize_t set_freq_show(struct device *dev,
char *buf) struct device_attribute *attr, char *buf)
{ {
struct devfreq *devfreq = to_devfreq(dev); struct devfreq *devfreq = to_devfreq(dev);
struct userspace_data *data; struct userspace_data *data;
...@@ -70,7 +70,7 @@ static ssize_t show_freq(struct device *dev, struct device_attribute *attr, ...@@ -70,7 +70,7 @@ static ssize_t show_freq(struct device *dev, struct device_attribute *attr,
return err; return err;
} }
static DEVICE_ATTR(set_freq, 0644, show_freq, store_freq); static DEVICE_ATTR_RW(set_freq);
static struct attribute *dev_entries[] = { static struct attribute *dev_entries[] = {
&dev_attr_set_freq.attr, &dev_attr_set_freq.attr,
NULL, NULL,
......
...@@ -45,18 +45,6 @@ static int imx_bus_get_cur_freq(struct device *dev, unsigned long *freq) ...@@ -45,18 +45,6 @@ static int imx_bus_get_cur_freq(struct device *dev, unsigned long *freq)
return 0; return 0;
} }
static int imx_bus_get_dev_status(struct device *dev,
struct devfreq_dev_status *stat)
{
struct imx_bus *priv = dev_get_drvdata(dev);
stat->busy_time = 0;
stat->total_time = 0;
stat->current_frequency = clk_get_rate(priv->clk);
return 0;
}
static void imx_bus_exit(struct device *dev) static void imx_bus_exit(struct device *dev)
{ {
struct imx_bus *priv = dev_get_drvdata(dev); struct imx_bus *priv = dev_get_drvdata(dev);
...@@ -129,9 +117,7 @@ static int imx_bus_probe(struct platform_device *pdev) ...@@ -129,9 +117,7 @@ static int imx_bus_probe(struct platform_device *pdev)
return ret; return ret;
} }
priv->profile.polling_ms = 1000;
priv->profile.target = imx_bus_target; priv->profile.target = imx_bus_target;
priv->profile.get_dev_status = imx_bus_get_dev_status;
priv->profile.exit = imx_bus_exit; priv->profile.exit = imx_bus_exit;
priv->profile.get_cur_freq = imx_bus_get_cur_freq; priv->profile.get_cur_freq = imx_bus_get_cur_freq;
priv->profile.initial_freq = clk_get_rate(priv->clk); priv->profile.initial_freq = clk_get_rate(priv->clk);
......
...@@ -688,6 +688,7 @@ static struct devfreq_dev_profile tegra_devfreq_profile = { ...@@ -688,6 +688,7 @@ static struct devfreq_dev_profile tegra_devfreq_profile = {
.polling_ms = ACTMON_SAMPLING_PERIOD, .polling_ms = ACTMON_SAMPLING_PERIOD,
.target = tegra_devfreq_target, .target = tegra_devfreq_target,
.get_dev_status = tegra_devfreq_get_dev_status, .get_dev_status = tegra_devfreq_get_dev_status,
.is_cooling_device = true,
}; };
static int tegra_governor_get_target(struct devfreq *devfreq, static int tegra_governor_get_target(struct devfreq *devfreq,
......
...@@ -1484,6 +1484,36 @@ static void __init sklh_idle_state_table_update(void) ...@@ -1484,6 +1484,36 @@ static void __init sklh_idle_state_table_update(void)
skl_cstates[6].flags |= CPUIDLE_FLAG_UNUSABLE; /* C9-SKL */ skl_cstates[6].flags |= CPUIDLE_FLAG_UNUSABLE; /* C9-SKL */
} }
/**
* skx_idle_state_table_update - Adjust the Sky Lake/Cascade Lake
* idle states table.
*/
static void __init skx_idle_state_table_update(void)
{
unsigned long long msr;
rdmsrl(MSR_PKG_CST_CONFIG_CONTROL, msr);
/*
* 000b: C0/C1 (no package C-state support)
* 001b: C2
* 010b: C6 (non-retention)
* 011b: C6 (retention)
* 111b: No Package C state limits.
*/
if ((msr & 0x7) < 2) {
/*
* Uses the CC6 + PC0 latency and 3 times of
* latency for target_residency if the PC6
* is disabled in BIOS. This is consistent
* with how intel_idle driver uses _CST
* to set the target_residency.
*/
skx_cstates[2].exit_latency = 92;
skx_cstates[2].target_residency = 276;
}
}
static bool __init intel_idle_verify_cstate(unsigned int mwait_hint) static bool __init intel_idle_verify_cstate(unsigned int mwait_hint)
{ {
unsigned int mwait_cstate = MWAIT_HINT2CSTATE(mwait_hint) + 1; unsigned int mwait_cstate = MWAIT_HINT2CSTATE(mwait_hint) + 1;
...@@ -1515,6 +1545,9 @@ static void __init intel_idle_init_cstates_icpu(struct cpuidle_driver *drv) ...@@ -1515,6 +1545,9 @@ static void __init intel_idle_init_cstates_icpu(struct cpuidle_driver *drv)
case INTEL_FAM6_SKYLAKE: case INTEL_FAM6_SKYLAKE:
sklh_idle_state_table_update(); sklh_idle_state_table_update();
break; break;
case INTEL_FAM6_SKYLAKE_X:
skx_idle_state_table_update();
break;
} }
for (cstate = 0; cstate < CPUIDLE_STATE_MAX; ++cstate) { for (cstate = 0; cstate < CPUIDLE_STATE_MAX; ++cstate) {
......
...@@ -893,6 +893,16 @@ static int _set_required_opps(struct device *dev, ...@@ -893,6 +893,16 @@ static int _set_required_opps(struct device *dev,
if (!required_opp_tables) if (!required_opp_tables)
return 0; return 0;
/*
* We only support genpd's OPPs in the "required-opps" for now, as we
* don't know much about other use cases. Error out if the required OPP
* doesn't belong to a genpd.
*/
if (unlikely(!required_opp_tables[0]->is_genpd)) {
dev_err(dev, "required-opps don't belong to a genpd\n");
return -ENOENT;
}
/* required-opps not fully initialized yet */ /* required-opps not fully initialized yet */
if (lazy_linking_pending(opp_table)) if (lazy_linking_pending(opp_table))
return -EBUSY; return -EBUSY;
......
...@@ -197,21 +197,8 @@ static void _opp_table_alloc_required_tables(struct opp_table *opp_table, ...@@ -197,21 +197,8 @@ static void _opp_table_alloc_required_tables(struct opp_table *opp_table,
required_opp_tables[i] = _find_table_of_opp_np(required_np); required_opp_tables[i] = _find_table_of_opp_np(required_np);
of_node_put(required_np); of_node_put(required_np);
if (IS_ERR(required_opp_tables[i])) { if (IS_ERR(required_opp_tables[i]))
lazy = true; lazy = true;
continue;
}
/*
* We only support genpd's OPPs in the "required-opps" for now,
* as we don't know how much about other cases. Error out if the
* required OPP doesn't belong to a genpd.
*/
if (!required_opp_tables[i]->is_genpd) {
dev_err(dev, "required-opp doesn't belong to genpd: %pOF\n",
required_np);
goto free_required_tables;
}
} }
/* Let's do the linking later on */ /* Let's do the linking later on */
...@@ -379,13 +366,6 @@ static void lazy_link_required_opp_table(struct opp_table *new_table) ...@@ -379,13 +366,6 @@ static void lazy_link_required_opp_table(struct opp_table *new_table)
struct dev_pm_opp *opp; struct dev_pm_opp *opp;
int i, ret; int i, ret;
/*
* We only support genpd's OPPs in the "required-opps" for now,
* as we don't know much about other cases.
*/
if (!new_table->is_genpd)
return;
mutex_lock(&opp_table_lock); mutex_lock(&opp_table_lock);
list_for_each_entry_safe(opp_table, temp, &lazy_opp_tables, lazy) { list_for_each_entry_safe(opp_table, temp, &lazy_opp_tables, lazy) {
...@@ -433,8 +413,7 @@ static void lazy_link_required_opp_table(struct opp_table *new_table) ...@@ -433,8 +413,7 @@ static void lazy_link_required_opp_table(struct opp_table *new_table)
/* All required opp-tables found, remove from lazy list */ /* All required opp-tables found, remove from lazy list */
if (!lazy) { if (!lazy) {
list_del(&opp_table->lazy); list_del_init(&opp_table->lazy);
INIT_LIST_HEAD(&opp_table->lazy);
list_for_each_entry(opp, &opp_table->opp_list, node) list_for_each_entry(opp, &opp_table->opp_list, node)
_required_opps_available(opp, opp_table->required_opp_count); _required_opps_available(opp, opp_table->required_opp_count);
...@@ -874,7 +853,7 @@ static struct dev_pm_opp *_opp_add_static_v2(struct opp_table *opp_table, ...@@ -874,7 +853,7 @@ static struct dev_pm_opp *_opp_add_static_v2(struct opp_table *opp_table,
return ERR_PTR(-ENOMEM); return ERR_PTR(-ENOMEM);
ret = _read_opp_key(new_opp, opp_table, np, &rate_not_available); ret = _read_opp_key(new_opp, opp_table, np, &rate_not_available);
if (ret < 0 && !opp_table->is_genpd) { if (ret < 0) {
dev_err(dev, "%s: opp key field not found\n", __func__); dev_err(dev, "%s: opp key field not found\n", __func__);
goto free_opp; goto free_opp;
} }
......
...@@ -198,6 +198,7 @@ struct generic_pm_domain_data { ...@@ -198,6 +198,7 @@ struct generic_pm_domain_data {
struct notifier_block *power_nb; struct notifier_block *power_nb;
int cpu; int cpu;
unsigned int performance_state; unsigned int performance_state;
unsigned int rpm_pstate;
ktime_t next_wakeup; ktime_t next_wakeup;
void *data; void *data;
}; };
......
...@@ -380,6 +380,9 @@ static inline int pm_runtime_get(struct device *dev) ...@@ -380,6 +380,9 @@ static inline int pm_runtime_get(struct device *dev)
* The possible return values of this function are the same as for * The possible return values of this function are the same as for
* pm_runtime_resume() and the runtime PM usage counter of @dev remains * pm_runtime_resume() and the runtime PM usage counter of @dev remains
* incremented in all cases, even if it returns an error code. * incremented in all cases, even if it returns an error code.
* Consider using pm_runtime_resume_and_get() instead of it, especially
* if its return value is checked by the caller, as this is likely to result
* in cleaner code.
*/ */
static inline int pm_runtime_get_sync(struct device *dev) static inline int pm_runtime_get_sync(struct device *dev)
{ {
......
...@@ -331,7 +331,7 @@ static void *chain_alloc(struct chain_allocator *ca, unsigned int size) ...@@ -331,7 +331,7 @@ static void *chain_alloc(struct chain_allocator *ca, unsigned int size)
* *
* Memory bitmap is a structure consisting of many linked lists of * Memory bitmap is a structure consisting of many linked lists of
* objects. The main list's elements are of type struct zone_bitmap * objects. The main list's elements are of type struct zone_bitmap
* and each of them corresonds to one zone. For each zone bitmap * and each of them corresponds to one zone. For each zone bitmap
* object there is a list of objects of type struct bm_block that * object there is a list of objects of type struct bm_block that
* represent each blocks of bitmap in which information is stored. * represent each blocks of bitmap in which information is stored.
* *
...@@ -1500,7 +1500,7 @@ static struct memory_bitmap copy_bm; ...@@ -1500,7 +1500,7 @@ static struct memory_bitmap copy_bm;
/** /**
* swsusp_free - Free pages allocated for hibernation image. * swsusp_free - Free pages allocated for hibernation image.
* *
* Image pages are alocated before snapshot creation, so they need to be * Image pages are allocated before snapshot creation, so they need to be
* released after resume. * released after resume.
*/ */
void swsusp_free(void) void swsusp_free(void)
...@@ -2326,7 +2326,7 @@ static struct memory_bitmap *safe_highmem_bm; ...@@ -2326,7 +2326,7 @@ static struct memory_bitmap *safe_highmem_bm;
* (@nr_highmem_p points to the variable containing the number of highmem image * (@nr_highmem_p points to the variable containing the number of highmem image
* pages). The pages that are "safe" (ie. will not be overwritten when the * pages). The pages that are "safe" (ie. will not be overwritten when the
* hibernation image is restored entirely) have the corresponding bits set in * hibernation image is restored entirely) have the corresponding bits set in
* @bm (it must be unitialized). * @bm (it must be uninitialized).
* *
* NOTE: This function should not be called if there are no highmem image pages. * NOTE: This function should not be called if there are no highmem image pages.
*/ */
...@@ -2483,7 +2483,7 @@ static inline void free_highmem_data(void) {} ...@@ -2483,7 +2483,7 @@ static inline void free_highmem_data(void) {}
/** /**
* prepare_image - Make room for loading hibernation image. * prepare_image - Make room for loading hibernation image.
* @new_bm: Unitialized memory bitmap structure. * @new_bm: Uninitialized memory bitmap structure.
* @bm: Memory bitmap with unsafe pages marked. * @bm: Memory bitmap with unsafe pages marked.
* *
* Use @bm to mark the pages that will be overwritten in the process of * Use @bm to mark the pages that will be overwritten in the process of
......
...@@ -1125,7 +1125,7 @@ struct dec_data { ...@@ -1125,7 +1125,7 @@ struct dec_data {
}; };
/** /**
* Deompression function that runs in its own thread. * Decompression function that runs in its own thread.
*/ */
static int lzo_decompress_threadfn(void *data) static int lzo_decompress_threadfn(void *data)
{ {
......
Markdown is supported
0%
or
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment