• Douglas Anderson's avatar
    watchdog/hardlockup: detect hard lockups using secondary (buddy) CPUs · 1f423c90
    Douglas Anderson authored
    Implement a hardlockup detector that doesn't doesn't need any extra
    arch-specific support code to detect lockups.  Instead of using something
    arch-specific we will use the buddy system, where each CPU watches out for
    another one.  Specifically, each CPU will use its softlockup hrtimer to
    check that the next CPU is processing hrtimer interrupts by verifying that
    a counter is increasing.
    
    NOTE: unlike the other hard lockup detectors, the buddy one can't easily
    show what's happening on the CPU that locked up just by doing a simple
    backtrace.  It relies on some other mechanism in the system to get
    information about the locked up CPUs.  This could be support for NMI
    backtraces like [1], it could be a mechanism for printing the PC of locked
    CPUs at panic time like [2] / [3], or it could be something else.  Even
    though that means we still rely on arch-specific code, this arch-specific
    code seems to often be implemented even on architectures that don't have a
    hardlockup detector.
    
    This style of hardlockup detector originated in some downstream Android
    trees and has been rebased on / carried in ChromeOS trees for quite a long
    time for use on arm and arm64 boards.  Historically on these boards we've
    leveraged mechanism [2] / [3] to get information about hung CPUs, but we
    could move to [1].
    
    Although the original motivation for the buddy system was for use on
    systems without an arch-specific hardlockup detector, it can still be
    useful to use even on systems that _do_ have an arch-specific hardlockup
    detector.  On x86, for instance, there is a 24-part patch series [4] in
    progress switching the arch-specific hard lockup detector from a scarce
    perf counter to a less-scarce hardware resource.  Potentially the buddy
    system could be a simpler alternative to free up the perf counter but
    still get hard lockup detection.
    
    Overall, pros (+) and cons (-) of the buddy system compared to an
    arch-specific hardlockup detector (which might be implemented using
    perf):
    + The buddy system is usable on systems that don't have an
      arch-specific hardlockup detector, like arm32 and arm64 (though it's
      being worked on for arm64 [5]).
    + The buddy system may free up scarce hardware resources.
    + If a CPU totally goes out to lunch (can't process NMIs) the buddy
      system could still detect the problem (though it would be unlikely
      to be able to get a stack trace).
    + The buddy system uses the same timer function to pet the hardlockup
      detector on the running CPU as it uses to detect hardlockups on
      other CPUs. Compared to other hardlockup detectors, this means it
      generates fewer interrupts and thus is likely better able to let
      CPUs stay idle longer.
    - If all CPUs are hard locked up at the same time the buddy system
      can't detect it.
    - If we don't have SMP we can't use the buddy system.
    - The buddy system needs an arch-specific mechanism (possibly NMI
      backtrace) to get info about the locked up CPU.
    
    [1] https://lore.kernel.org/r/20230419225604.21204-1-dianders@chromium.org
    [2] https://issuetracker.google.com/172213129
    [3] https://docs.kernel.org/trace/coresight/coresight-cpu-debug.html
    [4] https://lore.kernel.org/lkml/20230301234753.28582-1-ricardo.neri-calderon@linux.intel.com/
    [5] https://lore.kernel.org/linux-arm-kernel/20220903093415.15850-1-lecopzer.chen@mediatek.com/
    
    Link: https://lkml.kernel.org/r/20230519101840.v5.14.I6bf789d21d0c3d75d382e7e51a804a7a51315f2c@changeidSigned-off-by: default avatarColin Cross <ccross@android.com>
    Signed-off-by: default avatarMatthias Kaehlcke <mka@chromium.org>
    Signed-off-by: default avatarGuenter Roeck <groeck@chromium.org>
    Signed-off-by: default avatarTzung-Bi Shih <tzungbi@chromium.org>
    Signed-off-by: default avatarDouglas Anderson <dianders@chromium.org>
    Cc: Andi Kleen <ak@linux.intel.com>
    Cc: Catalin Marinas <catalin.marinas@arm.com>
    Cc: Chen-Yu Tsai <wens@csie.org>
    Cc: Christophe Leroy <christophe.leroy@csgroup.eu>
    Cc: Daniel Thompson <daniel.thompson@linaro.org>
    Cc: "David S. Miller" <davem@davemloft.net>
    Cc: Ian Rogers <irogers@google.com>
    Cc: Marc Zyngier <maz@kernel.org>
    Cc: Mark Rutland <mark.rutland@arm.com>
    Cc: Masayoshi Mizuma <msys.mizuma@gmail.com>
    Cc: Michael Ellerman <mpe@ellerman.id.au>
    Cc: Nicholas Piggin <npiggin@gmail.com>
    Cc: Petr Mladek <pmladek@suse.com>
    Cc: Pingfan Liu <kernelfans@gmail.com>
    Cc: Randy Dunlap <rdunlap@infradead.org>
    Cc: "Ravi V. Shankar" <ravi.v.shankar@intel.com>
    Cc: Ricardo Neri <ricardo.neri@intel.com>
    Cc: Stephane Eranian <eranian@google.com>
    Cc: Stephen Boyd <swboyd@chromium.org>
    Cc: Sumit Garg <sumit.garg@linaro.org>
    Cc: Will Deacon <will@kernel.org>
    Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
    1f423c90
Makefile 5.2 KB