Commit 08a2e45a authored by Rafael J. Wysocki's avatar Rafael J. Wysocki

Merge branches 'pm-cpuidle' and 'powercap'

* pm-cpuidle:
  ACPI / processor: Set P_LVL{2,3} idle state descriptions
  intel_idle: add support for Jacobsville
  cpuidle: dt: bail out if the idle-state DT node is not compatible
  cpuidle: use BIT() for idle state flags and remove CPUIDLE_DRIVER_FLAGS_MASK
  Documentation: driver-api: PM: Add cpuidle document
  cpuidle: New timer events oriented governor for tickless systems

* powercap:
  powercap/intel_rapl: add Ice Lake mobile
  powercap: intel_rapl: add support for Jacobsville
......@@ -155,14 +155,14 @@ governor uses that information depends on what algorithm is implemented by it
and that is the primary reason for having more than one governor in the
``CPUIdle`` subsystem.
There are two ``CPUIdle`` governors available, ``menu`` and ``ladder``. Which
of them is used depends on the configuration of the kernel and in particular on
whether or not the scheduler tick can be `stopped by the idle
loop <idle-cpus-and-tick_>`_. It is possible to change the governor at run time
if the ``cpuidle_sysfs_switch`` command line parameter has been passed to the
kernel, but that is not safe in general, so it should not be done on production
systems (that may change in the future, though). The name of the ``CPUIdle``
governor currently used by the kernel can be read from the
There are three ``CPUIdle`` governors available, ``menu``, `TEO <teo-gov_>`_
and ``ladder``. Which of them is used by default depends on the configuration
of the kernel and in particular on whether or not the scheduler tick can be
`stopped by the idle loop <idle-cpus-and-tick_>`_. It is possible to change the
governor at run time if the ``cpuidle_sysfs_switch`` command line parameter has
been passed to the kernel, but that is not safe in general, so it should not be
done on production systems (that may change in the future, though). The name of
the ``CPUIdle`` governor currently used by the kernel can be read from the
:file:`current_governor_ro` (or :file:`current_governor` if
``cpuidle_sysfs_switch`` is present in the kernel command line) file under
:file:`/sys/devices/system/cpu/cpuidle/` in ``sysfs``.
......@@ -256,6 +256,8 @@ the ``menu`` governor by default and if it is not tickless, the default
``CPUIdle`` governor on it will be ``ladder``.
.. _menu-gov:
The ``menu`` Governor
=====================
......@@ -333,6 +335,92 @@ that time, the governor may need to select a shallower state with a suitable
target residency.
.. _teo-gov:
The Timer Events Oriented (TEO) Governor
========================================
The timer events oriented (TEO) governor is an alternative ``CPUIdle`` governor
for tickless systems. It follows the same basic strategy as the ``menu`` `one
<menu-gov_>`_: it always tries to find the deepest idle state suitable for the
given conditions. However, it applies a different approach to that problem.
First, it does not use sleep length correction factors, but instead it attempts
to correlate the observed idle duration values with the available idle states
and use that information to pick up the idle state that is most likely to
"match" the upcoming CPU idle interval. Second, it does not take the tasks
that were running on the given CPU in the past and are waiting on some I/O
operations to complete now at all (there is no guarantee that they will run on
the same CPU when they become runnable again) and the pattern detection code in
it avoids taking timer wakeups into account. It also only uses idle duration
values less than the current time till the closest timer (with the scheduler
tick excluded) for that purpose.
Like in the ``menu`` governor `case <menu-gov_>`_, the first step is to obtain
the *sleep length*, which is the time until the closest timer event with the
assumption that the scheduler tick will be stopped (that also is the upper bound
on the time until the next CPU wakeup). That value is then used to preselect an
idle state on the basis of three metrics maintained for each idle state provided
by the ``CPUIdle`` driver: ``hits``, ``misses`` and ``early_hits``.
The ``hits`` and ``misses`` metrics measure the likelihood that a given idle
state will "match" the observed (post-wakeup) idle duration if it "matches" the
sleep length. They both are subject to decay (after a CPU wakeup) every time
the target residency of the idle state corresponding to them is less than or
equal to the sleep length and the target residency of the next idle state is
greater than the sleep length (that is, when the idle state corresponding to
them "matches" the sleep length). The ``hits`` metric is increased if the
former condition is satisfied and the target residency of the given idle state
is less than or equal to the observed idle duration and the target residency of
the next idle state is greater than the observed idle duration at the same time
(that is, it is increased when the given idle state "matches" both the sleep
length and the observed idle duration). In turn, the ``misses`` metric is
increased when the given idle state "matches" the sleep length only and the
observed idle duration is too short for its target residency.
The ``early_hits`` metric measures the likelihood that a given idle state will
"match" the observed (post-wakeup) idle duration if it does not "match" the
sleep length. It is subject to decay on every CPU wakeup and it is increased
when the idle state corresponding to it "matches" the observed (post-wakeup)
idle duration and the target residency of the next idle state is less than or
equal to the sleep length (i.e. the idle state "matching" the sleep length is
deeper than the given one).
The governor walks the list of idle states provided by the ``CPUIdle`` driver
and finds the last (deepest) one with the target residency less than or equal
to the sleep length. Then, the ``hits`` and ``misses`` metrics of that idle
state are compared with each other and it is preselected if the ``hits`` one is
greater (which means that that idle state is likely to "match" the observed idle
duration after CPU wakeup). If the ``misses`` one is greater, the governor
preselects the shallower idle state with the maximum ``early_hits`` metric
(or if there are multiple shallower idle states with equal ``early_hits``
metric which also is the maximum, the shallowest of them will be preselected).
[If there is a wakeup latency constraint coming from the `PM QoS framework
<cpu-pm-qos_>`_ which is hit before reaching the deepest idle state with the
target residency within the sleep length, the deepest idle state with the exit
latency within the constraint is preselected without consulting the ``hits``,
``misses`` and ``early_hits`` metrics.]
Next, the governor takes several idle duration values observed most recently
into consideration and if at least a half of them are greater than or equal to
the target residency of the preselected idle state, that idle state becomes the
final candidate to ask for. Otherwise, the average of the most recent idle
duration values below the target residency of the preselected idle state is
computed and the governor walks the idle states shallower than the preselected
one and finds the deepest of them with the target residency within that average.
That idle state is then taken as the final candidate to ask for.
Still, at this point the governor may need to refine the idle state selection if
it has not decided to `stop the scheduler tick <idle-cpus-and-tick_>`_. That
generally happens if the target residency of the idle state selected so far is
less than the tick period and the tick has not been stopped already (in a
previous iteration of the idle loop). Then, like in the ``menu`` governor
`case <menu-gov_>`_, the sleep length used in the previous computations may not
reflect the real time until the closest timer event and if it really is greater
than that time, a shallower state with a suitable target residency may need to
be selected.
.. _idle-states-representation:
Representation of Idle States
......
Supporting multiple CPU idle levels in kernel
cpuidle drivers
cpuidle driver hooks into the cpuidle infrastructure and handles the
architecture/platform dependent part of CPU idle states. Driver
provides the platform idle state detection capability and also
has mechanisms in place to support actual entry-exit into CPU idle states.
cpuidle driver initializes the cpuidle_device structure for each CPU device
and registers with cpuidle using cpuidle_register_device.
If all the idle states are the same, the wrapper function cpuidle_register
could be used instead.
It can also support the dynamic changes (like battery <-> AC), by using
cpuidle_pause_and_lock, cpuidle_disable_device and cpuidle_enable_device,
cpuidle_resume_and_unlock.
Interfaces:
extern int cpuidle_register(struct cpuidle_driver *drv,
const struct cpumask *const coupled_cpus);
extern int cpuidle_unregister(struct cpuidle_driver *drv);
extern int cpuidle_register_driver(struct cpuidle_driver *drv);
extern void cpuidle_unregister_driver(struct cpuidle_driver *drv);
extern int cpuidle_register_device(struct cpuidle_device *dev);
extern void cpuidle_unregister_device(struct cpuidle_device *dev);
extern void cpuidle_pause_and_lock(void);
extern void cpuidle_resume_and_unlock(void);
extern int cpuidle_enable_device(struct cpuidle_device *dev);
extern void cpuidle_disable_device(struct cpuidle_device *dev);
Supporting multiple CPU idle levels in kernel
cpuidle governors
cpuidle governor is policy routine that decides what idle state to enter at
any given time. cpuidle core uses different callbacks to the governor.
* enable() to enable governor for a particular device
* disable() to disable governor for a particular device
* select() to select an idle state to enter
* reflect() called after returning from the idle state, which can be used
by the governor for some record keeping.
More than one governor can be registered at the same time and
users can switch between drivers using /sysfs interface (when enabled).
More than one governor part is supported for developers to easily experiment
with different governors. By default, most optimal governor based on your
kernel configuration and platform will be selected by cpuidle.
Interfaces:
extern int cpuidle_register_governor(struct cpuidle_governor *gov);
struct cpuidle_governor
This diff is collapsed.
=======================
Device Power Management
=======================
===============================
CPU and Device Power Management
===============================
.. toctree::
cpuidle
devices
notifiers
types
......
......@@ -4021,6 +4021,7 @@ S: Maintained
T: git git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm.git
B: https://bugzilla.kernel.org
F: Documentation/admin-guide/pm/cpuidle.rst
F: Documentation/driver-api/pm/cpuidle.rst
F: drivers/cpuidle/*
F: include/linux/cpuidle.h
......
......@@ -282,6 +282,13 @@ static int acpi_processor_get_power_info_fadt(struct acpi_processor *pr)
pr->power.states[ACPI_STATE_C2].address,
pr->power.states[ACPI_STATE_C3].address));
snprintf(pr->power.states[ACPI_STATE_C2].desc,
ACPI_CX_DESC_LEN, "ACPI P_LVL2 IOPORT 0x%x",
pr->power.states[ACPI_STATE_C2].address);
snprintf(pr->power.states[ACPI_STATE_C3].desc,
ACPI_CX_DESC_LEN, "ACPI P_LVL3 IOPORT 0x%x",
pr->power.states[ACPI_STATE_C3].address);
return 0;
}
......
......@@ -4,7 +4,7 @@ config CPU_IDLE
bool "CPU idle PM support"
default y if ACPI || PPC_PSERIES
select CPU_IDLE_GOV_LADDER if (!NO_HZ && !NO_HZ_IDLE)
select CPU_IDLE_GOV_MENU if (NO_HZ || NO_HZ_IDLE)
select CPU_IDLE_GOV_MENU if (NO_HZ || NO_HZ_IDLE) && !CPU_IDLE_GOV_TEO
help
CPU idle is a generic framework for supporting software-controlled
idle processor power management. It includes modular cross-platform
......@@ -23,6 +23,15 @@ config CPU_IDLE_GOV_LADDER
config CPU_IDLE_GOV_MENU
bool "Menu governor (for tickless system)"
config CPU_IDLE_GOV_TEO
bool "Timer events oriented (TEO) governor (for tickless systems)"
help
This governor implements a simplified idle state selection method
focused on timer events and does not do any interactivity boosting.
Some workloads benefit from using it and it generally should be safe
to use. Say Y here if you are not happy with the alternatives.
config DT_IDLE_STATES
bool
......
......@@ -22,16 +22,12 @@
#include "dt_idle_states.h"
static int init_state_node(struct cpuidle_state *idle_state,
const struct of_device_id *matches,
const struct of_device_id *match_id,
struct device_node *state_node)
{
int err;
const struct of_device_id *match_id;
const char *desc;
match_id = of_match_node(matches, state_node);
if (!match_id)
return -ENODEV;
/*
* CPUidle drivers are expected to initialize the const void *data
* pointer of the passed in struct of_device_id array to the idle
......@@ -160,6 +156,7 @@ int dt_init_idle_driver(struct cpuidle_driver *drv,
{
struct cpuidle_state *idle_state;
struct device_node *state_node, *cpu_node;
const struct of_device_id *match_id;
int i, err = 0;
const cpumask_t *cpumask;
unsigned int state_idx = start_idx;
......@@ -180,6 +177,12 @@ int dt_init_idle_driver(struct cpuidle_driver *drv,
if (!state_node)
break;
match_id = of_match_node(matches, state_node);
if (!match_id) {
err = -ENODEV;
break;
}
if (!of_device_is_available(state_node)) {
of_node_put(state_node);
continue;
......@@ -198,7 +201,7 @@ int dt_init_idle_driver(struct cpuidle_driver *drv,
}
idle_state = &drv->states[state_idx++];
err = init_state_node(idle_state, matches, state_node);
err = init_state_node(idle_state, match_id, state_node);
if (err) {
pr_err("Parsing idle state node %pOF failed with err %d\n",
state_node, err);
......
......@@ -4,3 +4,4 @@
obj-$(CONFIG_CPU_IDLE_GOV_LADDER) += ladder.o
obj-$(CONFIG_CPU_IDLE_GOV_MENU) += menu.o
obj-$(CONFIG_CPU_IDLE_GOV_TEO) += teo.o
This diff is collapsed.
......@@ -1103,6 +1103,7 @@ static const struct x86_cpu_id intel_idle_ids[] __initconst = {
INTEL_CPU_FAM6(ATOM_GOLDMONT, idle_cpu_bxt),
INTEL_CPU_FAM6(ATOM_GOLDMONT_PLUS, idle_cpu_bxt),
INTEL_CPU_FAM6(ATOM_GOLDMONT_X, idle_cpu_dnv),
INTEL_CPU_FAM6(ATOM_TREMONT_X, idle_cpu_dnv),
{}
};
......
......@@ -1156,6 +1156,7 @@ static const struct x86_cpu_id rapl_ids[] __initconst = {
INTEL_CPU_FAM6(KABYLAKE_MOBILE, rapl_defaults_core),
INTEL_CPU_FAM6(KABYLAKE_DESKTOP, rapl_defaults_core),
INTEL_CPU_FAM6(CANNONLAKE_MOBILE, rapl_defaults_core),
INTEL_CPU_FAM6(ICELAKE_MOBILE, rapl_defaults_core),
INTEL_CPU_FAM6(ATOM_SILVERMONT, rapl_defaults_byt),
INTEL_CPU_FAM6(ATOM_AIRMONT, rapl_defaults_cht),
......@@ -1164,6 +1165,7 @@ static const struct x86_cpu_id rapl_ids[] __initconst = {
INTEL_CPU_FAM6(ATOM_GOLDMONT, rapl_defaults_core),
INTEL_CPU_FAM6(ATOM_GOLDMONT_PLUS, rapl_defaults_core),
INTEL_CPU_FAM6(ATOM_GOLDMONT_X, rapl_defaults_core),
INTEL_CPU_FAM6(ATOM_TREMONT_X, rapl_defaults_core),
INTEL_CPU_FAM6(XEON_PHI_KNL, rapl_defaults_hsw_server),
INTEL_CPU_FAM6(XEON_PHI_KNM, rapl_defaults_hsw_server),
......
......@@ -69,11 +69,9 @@ struct cpuidle_state {
/* Idle State Flags */
#define CPUIDLE_FLAG_NONE (0x00)
#define CPUIDLE_FLAG_POLLING (0x01) /* polling state */
#define CPUIDLE_FLAG_COUPLED (0x02) /* state applies to multiple cpus */
#define CPUIDLE_FLAG_TIMER_STOP (0x04) /* timer is stopped on this state */
#define CPUIDLE_DRIVER_FLAGS_MASK (0xFFFF0000)
#define CPUIDLE_FLAG_POLLING BIT(0) /* polling state */
#define CPUIDLE_FLAG_COUPLED BIT(1) /* state applies to multiple cpus */
#define CPUIDLE_FLAG_TIMER_STOP BIT(2) /* timer is stopped on this state */
struct cpuidle_device_kobj;
struct cpuidle_state_kobj;
......
Markdown is supported
0%
or
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment