1. 19 Mar, 2024 2 commits
    • Frederic Weisbecker's avatar
      timers: Fix removed self-IPI on global timer's enqueue in nohz_full · 03877039
      Frederic Weisbecker authored
      While running in nohz_full mode, a task may enqueue a timer while the
      tick is stopped. However the only places where the timer wheel,
      alongside the timer migration machinery's decision, may reprogram the
      next event accordingly with that new timer's expiry are the idle loop or
      any IRQ tail.
      
      However neither the idle task nor an interrupt may run on the CPU if it
      resumes busy work in userspace for a long while in full dynticks mode.
      
      To solve this, the timer enqueue path raises a self-IPI that will
      re-evaluate the timer wheel on its IRQ tail. This asynchronous solution
      avoids potential locking inversion.
      
      This is supposed to happen both for local and global timers but commit:
      
      	b2cf7507 ("timers: Always queue timers on the local CPU")
      
      broke the global timers case with removing the ->is_idle field handling
      for the global base. As a result, global timers enqueue may go unnoticed
      in nohz_full.
      
      Fix this with restoring the idle tracking of the global timer's base,
      allowing self-IPIs again on enqueue time.
      
      Fixes: b2cf7507 ("timers: Always queue timers on the local CPU")
      Reported-by: default avatarPaul E. McKenney <paulmck@kernel.org>
      Signed-off-by: default avatarFrederic Weisbecker <frederic@kernel.org>
      Signed-off-by: default avatarThomas Gleixner <tglx@linutronix.de>
      Link: https://lore.kernel.org/r/20240318230729.15497-3-frederic@kernel.org
      03877039
    • Frederic Weisbecker's avatar
      timers/migration: Fix endless timer requeue after idle interrupts · f55acb1e
      Frederic Weisbecker authored
      When a CPU is an idle migrator, but another CPU wakes up before it,
      becomes an active migrator and handles the queue, the initial idle
      migrator may end up endlessly reprogramming its clockevent, chasing ghost
      timers forever such as in the following scenario:
      
                     [GRP0:0]
                   migrator = 0
                   active   = 0
                   nextevt  = T1
                    /         \
                   0           1
                active        idle (T1)
      
      0) CPU 1 is idle and has a timer queued (T1), CPU 0 is active and is
      the active migrator.
      
                     [GRP0:0]
                   migrator = NONE
                   active   = NONE
                   nextevt  = T1
                    /         \
                   0           1
                idle        idle (T1)
                wakeup = T1
      
      1) CPU 0 is now idle and is therefore the idle migrator. It has
      programmed its next timer interrupt to handle T1.
      
                      [GRP0:0]
                   migrator = 1
                   active   = 1
                   nextevt  = KTIME_MAX
                    /         \
                   0           1
                idle        active
                wakeup = T1
      
      2) CPU 1 has woken up, it is now active and it has just handled its own
      timer T1.
      
      3) CPU 0 gets a timer interrupt to handle T1 but tmigr_handle_remote()
      realize it is not the migrator anymore. So it early returns without
      observing that T1 has been expired already and therefore without
      updating its ->wakeup value.
      
      4) CPU 0 goes into tmigr_cpu_new_timer() which also early returns
      because it doesn't queue a timer of its own. So ->wakeup is left
      unchanged and the next timer is programmed to fire now.
      
      5) goto 3) forever
      
      This results in timer interrupt storms in idle and also in nohz_full (as
      observed in rcutorture's TREE07 scenario).
      
      Fix this with forcing a re-evaluation of tmc->wakeup while trying
      remote timer handling when the CPU isn't the migrator anymmore. The
      check is inherently racy but in the worst case the CPU just races setting
      the KTIME_MAX value that a remote expiry also tries to set.
      
      Fixes: 7ee98877 ("timers: Implement the hierarchical pull model")
      Reported-by: default avatarPaul E. McKenney <paulmck@kernel.org>
      Signed-off-by: default avatarFrederic Weisbecker <frederic@kernel.org>
      Signed-off-by: default avatarThomas Gleixner <tglx@linutronix.de>
      Link: https://lore.kernel.org/r/20240318230729.15497-2-frederic@kernel.org
      f55acb1e
  2. 16 Mar, 2024 1 commit
    • Frederic Weisbecker's avatar
      timer/migration: Remove buggy early return on deactivation · 4b6f4c5a
      Frederic Weisbecker authored
      When a CPU enters into idle and deactivates itself from the timer
      migration hierarchy without any global timer of its own to propagate,
      the group event of that CPU is set to "ignore" and tmigr_update_events()
      accordingly performs an early return without considering timers queued
      by other CPUs.
      
      If the hierarchy has a single level, and the CPU is the last one to
      enter idle, it will ignore others' global timers, as in the following
      layout:
      
                 [GRP0:0]
               migrator = 0
               active   = 0
               nextevt  = T0i
                /         \
               0           1
            active (T0i)  idle (T1)
      
      0) CPU 0 is active thus its event is ignored (the letter 'i') and so are
      upper levels' events. CPU 1 is idle and has the timer T1 enqueued.
      
                 [GRP0:0]
               migrator = NONE
               active   = NONE
               nextevt  = T0i
                /         \
               0           1
            idle (T0i)  idle (T1)
      
      1) CPU 0 goes idle without global event queued. Therefore KTIME_MAX is
      pushed as its next expiry and its own event kept as "ignore". As a result
      tmigr_update_events() ignores T1 and CPU 0 goes to idle with T1
      unhandled.
      
      This isn't proper to single level hierarchy though. A similar issue,
      although slightly different, may arise on multi-level:
      
                                  [GRP1:0]
                               migrator = GRP0:0
                               active   = GRP0:0
                               nextevt  = T0:0i, T0:1
                               /              \
                    [GRP0:0]                  [GRP0:1]
                 migrator = 0              migrator = NONE
                 active   = 0              active   = NONE
                 nextevt  = T0i            nextevt  = T2
                 /         \                /         \
                0 (T0i)     1 (T1)         2 (T2)      3
            active         idle            idle       idle
      
      0) CPU 0 is active thus its event is ignored (the letter 'i') and so are
      upper levels' events. CPU 1 is idle and has the timer T1 enqueued.
      CPU 2 also has a timer. The expiry order is T0 (ignored) < T1 < T2
      
                                  [GRP1:0]
                               migrator = GRP0:0
                               active   = GRP0:0
                               nextevt  = T0:0i, T0:1
                               /              \
                    [GRP0:0]                  [GRP0:1]
                 migrator = NONE           migrator = NONE
                 active   = NONE           active   = NONE
                 nextevt  = T0i            nextevt  = T2
                 /         \                /         \
                0 (T0i)     1 (T1)         2 (T2)      3
              idle         idle            idle         idle
      
      1) CPU 0 goes idle without global event queued. Therefore KTIME_MAX is
      pushed as its next expiry and its own event kept as "ignore". As a result
      tmigr_update_events() ignores T1. The change only propagated up to 1st
      level so far.
      
                                  [GRP1:0]
                               migrator = NONE
                               active   = NONE
                               nextevt  = T0:1
                               /              \
                    [GRP0:0]                  [GRP0:1]
                 migrator = NONE           migrator = NONE
                 active   = NONE           active   = NONE
                 nextevt  = T0i            nextevt  = T2
                 /         \                /         \
                0 (T0i)     1 (T1)         2 (T2)      3
              idle         idle            idle         idle
      
      2) The change now propagates up to the top. tmigr_update_events() finds
      that the child event is ignored and thus removes it. The top level next
      event is now T2 which is returned to CPU 0 as its next effective expiry
      to take account for as the global idle migrator. However T1 has been
      ignored along the way, leaving it unhandled.
      
      Fix those issues with removing the buggy related early return. Ignored
      child events must not prevent from evaluating the other events within
      the same group.
      Reported-by: default avatarBoqun Feng <boqun.feng@gmail.com>
      Reported-by: default avatarFlorian Fainelli <f.fainelli@gmail.com>
      Reported-by: default avatarThomas Gleixner <tglx@linutronix.de>
      Signed-off-by: default avatarFrederic Weisbecker <frederic@kernel.org>
      Signed-off-by: default avatarThomas Gleixner <tglx@linutronix.de>
      Tested-by: default avatarFlorian Fainelli <florian.fainelli@broadcom.com>
      Link: https://lore.kernel.org/r/ZfOhB9ZByTZcBy4u@lothringen
      4b6f4c5a
  3. 06 Mar, 2024 1 commit
    • Frederic Weisbecker's avatar
      timer/migration: Fix quick check reporting late expiry · 8ca18367
      Frederic Weisbecker authored
      When a CPU is the last active in the hierarchy and it tries to enter
      into idle, the quick check looking up the next event towards cpuidle
      heuristics may report a too late expiry, such as in the following
      scenario:
      
                              [GRP1:0]
                           migrator = NONE
                           active   = NONE
                           nextevt  = T0:0, T0:1
                           /              \
                [GRP0:0]                  [GRP0:1]
             migrator = NONE           migrator = NONE
             active   = NONE           active   = NONE
             nextevt  = T0, T1         nextevt  = T2
             /         \                /         \
            0           1              2           3
          idle       idle           idle         idle
      
      0) The whole system is idle, and CPU 0 was the last migrator. CPU 0 has
      a timer (T0), CPU 1 has a timer (T1) and CPU 2 has a timer (T2). The
      expire order is T0 < T1 < T2.
      
                              [GRP1:0]
                           migrator = GRP0:0
                           active   = GRP0:0
                           nextevt  = T0:0(i), T0:1
                         /              \
                [GRP0:0]                  [GRP0:1]
             migrator = CPU0           migrator = NONE
             active   = CPU0           active   = NONE
             nextevt  = T0(i), T1      nextevt  = T2
             /         \                /         \
            0           1              2           3
          active       idle           idle         idle
      
      1) CPU 0 becomes active. The (i) means a now ignored timer.
      
                              [GRP1:0]
                           migrator = GRP0:0
                           active   = GRP0:0
                           nextevt  = T0:1
                           /              \
                [GRP0:0]                  [GRP0:1]
             migrator = CPU0           migrator = NONE
             active   = CPU0           active   = NONE
             nextevt  = T1             nextevt  = T2
             /         \                /         \
            0           1              2           3
          active       idle           idle         idle
      
      2) CPU 0 handles remote. No timer actually expired but ignored timers
         have been cleaned out and their sibling's timers haven't been
         propagated. As a result the top level's next event is T2 and not T1.
      
      3) CPU 0 tries to enter idle without any global timer enqueued and calls
         tmigr_quick_check(). The expiry of T2 is returned instead of the
         expiry of T1.
      
      When the quick check returns an expiry that is too late, the cpuidle
      governor may pick up a C-state that is too deep. This may be result into
      undesired CPU wake up latency if the next timer is actually close enough.
      
      Fix this with assuming that expiries aren't sorted top-down while
      performing the quick check. Pick up instead the earliest encountered one
      while walking up the hierarchy.
      
      7ee98877 ("timers: Implement the hierarchical pull model")
      Signed-off-by: default avatarFrederic Weisbecker <frederic@kernel.org>
      Signed-off-by: default avatarThomas Gleixner <tglx@linutronix.de>
      Link: https://lore.kernel.org/r/20240305002822.18130-1-frederic@kernel.org
      8ca18367
  4. 29 Feb, 2024 1 commit
    • Arnd Bergmann's avatar
      tick/sched: Fix build failure for CONFIG_NO_HZ_COMMON=n · a184d983
      Arnd Bergmann authored
      In configurations with CONFIG_TICK_ONESHOT but no CONFIG_NO_HZ or
      CONFIG_HIGH_RES_TIMERS, tick_sched_timer_dying() is stubbed out,
      but still defined as a global function as well:
      
      kernel/time/tick-sched.c:1599:6: error: redefinition of 'tick_sched_timer_dying'
       1599 | void tick_sched_timer_dying(int cpu)
            |      ^
      kernel/time/tick-sched.h:111:20: note: previous definition is here
        111 | static inline void tick_sched_timer_dying(int cpu) { }
            |                    ^
      
      This configuration only appears with ARM CONFIG_ARCH_BCM_MOBILE,
      which should not actually select CONFIG_TICK_ONESHOT.
      
      Adjust the #ifdef for the stub to match the condition for building the
      tick-sched.c file for consistency with the definition and to avoid
      the build regression.
      
      Fixes: 3aedb7fc ("tick/sched: Remove useless oneshot ifdeffery")
      Signed-off-by: default avatarArnd Bergmann <arnd@arndb.de>
      Signed-off-by: default avatarThomas Gleixner <tglx@linutronix.de>
      Link: https://lore.kernel.org/r/20240228123850.3499024-1-arnd@kernel.org
      a184d983
  5. 26 Feb, 2024 17 commits
  6. 22 Feb, 2024 18 commits