• Linus Torvalds's avatar
    Merge tag 'timers-core-2024-03-10' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip · d08c407f
    Linus Torvalds authored
    Pull timer updates from Thomas Gleixner:
     "A large set of updates and features for timers and timekeeping:
    
       - The hierarchical timer pull model
    
         When timer wheel timers are armed they are placed into the timer
         wheel of a CPU which is likely to be busy at the time of expiry.
         This is done to avoid wakeups on potentially idle CPUs.
    
         This is wrong in several aspects:
    
           1) The heuristics to select the target CPU are wrong by
              definition as the chance to get the prediction right is
              close to zero.
    
           2) Due to #1 it is possible that timers are accumulated on
              a single target CPU
    
           3) The required computation in the enqueue path is just overhead
              for dubious value especially under the consideration that the
              vast majority of timer wheel timers are either canceled or
              rearmed before they expire.
    
         The timer pull model avoids the above by removing the target
         computation on enqueue and queueing timers always on the CPU on
         which they get armed.
    
         This is achieved by having separate wheels for CPU pinned timers
         and global timers which do not care about where they expire.
    
         As long as a CPU is busy it handles both the pinned and the global
         timers which are queued on the CPU local timer wheels.
    
         When a CPU goes idle it evaluates its own timer wheels:
    
           - If the first expiring timer is a pinned timer, then the global
             timers can be ignored as the CPU will wake up before they
             expire.
    
           - If the first expiring timer is a global timer, then the expiry
             time is propagated into the timer pull hierarchy and the CPU
             makes sure to wake up for the first pinned timer.
    
         The timer pull hierarchy organizes CPUs in groups of eight at the
         lowest level and at the next levels groups of eight groups up to
         the point where no further aggregation of groups is required, i.e.
         the number of levels is log8(NR_CPUS). The magic number of eight
         has been established by experimention, but can be adjusted if
         needed.
    
         In each group one busy CPU acts as the migrator. It's only one CPU
         to avoid lock contention on remote timer wheels.
    
         The migrator CPU checks in its own timer wheel handling whether
         there are other CPUs in the group which have gone idle and have
         global timers to expire. If there are global timers to expire, the
         migrator locks the remote CPU timer wheel and handles the expiry.
    
         Depending on the group level in the hierarchy this handling can
         require to walk the hierarchy downwards to the CPU level.
    
         Special care is taken when the last CPU goes idle. At this point
         the CPU is the systemwide migrator at the top of the hierarchy and
         it therefore cannot delegate to the hierarchy. It needs to arm its
         own timer device to expire either at the first expiring timer in
         the hierarchy or at the first CPU local timer, which ever expires
         first.
    
         This completely removes the overhead from the enqueue path, which
         is e.g. for networking a true hotpath and trades it for a slightly
         more complex idle path.
    
         This has been in development for a couple of years and the final
         series has been extensively tested by various teams from silicon
         vendors and ran through extensive CI.
    
         There have been slight performance improvements observed on network
         centric workloads and an Intel team confirmed that this allows them
         to power down a die completely on a mult-die socket for the first
         time in a mostly idle scenario.
    
         There is only one outstanding ~1.5% regression on a specific
         overloaded netperf test which is currently investigated, but the
         rest is either positive or neutral performance wise and positive on
         the power management side.
    
       - Fixes for the timekeeping interpolation code for cross-timestamps:
    
         cross-timestamps are used for PTP to get snapshots from hardware
         timers and interpolated them back to clock MONOTONIC. The changes
         address a few corner cases in the interpolation code which got the
         math and logic wrong.
    
       - Simplifcation of the clocksource watchdog retry logic to
         automatically adjust to handle larger systems correctly instead of
         having more incomprehensible command line parameters.
    
       - Treewide consolidation of the VDSO data structures.
    
       - The usual small improvements and cleanups all over the place"
    
    * tag 'timers-core-2024-03-10' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (62 commits)
      timer/migration: Fix quick check reporting late expiry
      tick/sched: Fix build failure for CONFIG_NO_HZ_COMMON=n
      vdso/datapage: Quick fix - use asm/page-def.h for ARM64
      timers: Assert no next dyntick timer look-up while CPU is offline
      tick: Assume timekeeping is correctly handed over upon last offline idle call
      tick: Shut down low-res tick from dying CPU
      tick: Split nohz and highres features from nohz_mode
      tick: Move individual bit features to debuggable mask accesses
      tick: Move got_idle_tick away from common flags
      tick: Assume the tick can't be stopped in NOHZ_MODE_INACTIVE mode
      tick: Move broadcast cancellation up to CPUHP_AP_TICK_DYING
      tick: Move tick cancellation up to CPUHP_AP_TICK_DYING
      tick: Start centralizing tick related CPU hotplug operations
      tick/sched: Don't clear ts::next_tick again in can_stop_idle_tick()
      tick/sched: Rename tick_nohz_stop_sched_tick() to tick_nohz_full_stop_tick()
      tick: Use IS_ENABLED() whenever possible
      tick/sched: Remove useless oneshot ifdeffery
      tick/nohz: Remove duplicate between lowres and highres handlers
      tick/nohz: Remove duplicate between tick_nohz_switch_to_nohz() and tick_setup_sched_timer()
      hrtimer: Select housekeeping CPU during migration
      ...
    d08c407f
timekeeping.c 71.2 KB