Commit 8ca18367 authored by Frederic Weisbecker's avatar Frederic Weisbecker Committed by Thomas Gleixner

timer/migration: Fix quick check reporting late expiry

When a CPU is the last active in the hierarchy and it tries to enter
into idle, the quick check looking up the next event towards cpuidle
heuristics may report a too late expiry, such as in the following
scenario:

                        [GRP1:0]
                     migrator = NONE
                     active   = NONE
                     nextevt  = T0:0, T0:1
                     /              \
          [GRP0:0]                  [GRP0:1]
       migrator = NONE           migrator = NONE
       active   = NONE           active   = NONE
       nextevt  = T0, T1         nextevt  = T2
       /         \                /         \
      0           1              2           3
    idle       idle           idle         idle

0) The whole system is idle, and CPU 0 was the last migrator. CPU 0 has
a timer (T0), CPU 1 has a timer (T1) and CPU 2 has a timer (T2). The
expire order is T0 < T1 < T2.

                        [GRP1:0]
                     migrator = GRP0:0
                     active   = GRP0:0
                     nextevt  = T0:0(i), T0:1
                   /              \
          [GRP0:0]                  [GRP0:1]
       migrator = CPU0           migrator = NONE
       active   = CPU0           active   = NONE
       nextevt  = T0(i), T1      nextevt  = T2
       /         \                /         \
      0           1              2           3
    active       idle           idle         idle

1) CPU 0 becomes active. The (i) means a now ignored timer.

                        [GRP1:0]
                     migrator = GRP0:0
                     active   = GRP0:0
                     nextevt  = T0:1
                     /              \
          [GRP0:0]                  [GRP0:1]
       migrator = CPU0           migrator = NONE
       active   = CPU0           active   = NONE
       nextevt  = T1             nextevt  = T2
       /         \                /         \
      0           1              2           3
    active       idle           idle         idle

2) CPU 0 handles remote. No timer actually expired but ignored timers
   have been cleaned out and their sibling's timers haven't been
   propagated. As a result the top level's next event is T2 and not T1.

3) CPU 0 tries to enter idle without any global timer enqueued and calls
   tmigr_quick_check(). The expiry of T2 is returned instead of the
   expiry of T1.

When the quick check returns an expiry that is too late, the cpuidle
governor may pick up a C-state that is too deep. This may be result into
undesired CPU wake up latency if the next timer is actually close enough.

Fix this with assuming that expiries aren't sorted top-down while
performing the quick check. Pick up instead the earliest encountered one
while walking up the hierarchy.

7ee98877 ("timers: Implement the hierarchical pull model")
Signed-off-by: default avatarFrederic Weisbecker <frederic@kernel.org>
Signed-off-by: default avatarThomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/r/20240305002822.18130-1-frederic@kernel.org
parent a184d983
...@@ -1385,11 +1385,11 @@ u64 tmigr_cpu_deactivate(u64 nextexp) ...@@ -1385,11 +1385,11 @@ u64 tmigr_cpu_deactivate(u64 nextexp)
* single group active on the way to top level) * single group active on the way to top level)
* * nextevt - when CPU is offline and has to handle timer on his own * * nextevt - when CPU is offline and has to handle timer on his own
* or when on the way to top in every group only a single * or when on the way to top in every group only a single
* child is active and but @nextevt is before next_expiry * child is active but @nextevt is before the lowest
* of top level group * next_expiry encountered while walking up to top level.
* * next_expiry (top) - value of top level group, when on the way to top in * * next_expiry - value of lowest expiry encountered while walking groups
* every group only a single child is active and @nextevt * if only a single child is active on each and @nextevt
* is after this value active child. * is after this lowest expiry.
*/ */
u64 tmigr_quick_check(u64 nextevt) u64 tmigr_quick_check(u64 nextevt)
{ {
...@@ -1408,10 +1408,16 @@ u64 tmigr_quick_check(u64 nextevt) ...@@ -1408,10 +1408,16 @@ u64 tmigr_quick_check(u64 nextevt)
do { do {
if (!tmigr_check_lonely(group)) { if (!tmigr_check_lonely(group)) {
return KTIME_MAX; return KTIME_MAX;
} else if (!group->parent) { } else {
u64 first_global = READ_ONCE(group->next_expiry); /*
* Since current CPU is active, events may not be sorted
return min_t(u64, nextevt, first_global); * from bottom to the top because the CPU's event is ignored
* up to the top and its sibling's events not propagated upwards.
* Thus keep track of the lowest observed expiry.
*/
nextevt = min_t(u64, nextevt, READ_ONCE(group->next_expiry));
if (!group->parent)
return nextevt;
} }
group = group->parent; group = group->parent;
} while (group); } while (group);
......
Markdown is supported
0%
or
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment