• Rafael J. Wysocki's avatar
    thermal: core: Fix thermal zone suspend-resume synchronization · 4e814173
    Rafael J. Wysocki authored
    There are 3 synchronization issues with thermal zone suspend-resume
    during system-wide transitions:
    
     1. The resume code runs in a PM notifier which is invoked after user
        space has been thawed, so it can run concurrently with user space
        which can trigger a thermal zone device removal.  If that happens,
        the thermal zone resume code may use a stale pointer to the next
        list element and crash, because it does not hold thermal_list_lock
        while walking thermal_tz_list.
    
     2. The thermal zone resume code calls thermal_zone_device_init()
        outside the zone lock, so user space or an update triggered by
        the platform firmware may see an inconsistent state of a
        thermal zone leading to unexpected behavior.
    
     3. Clearing the in_suspend global variable in thermal_pm_notify()
        allows __thermal_zone_device_update() to continue for all thermal
        zones and it may as well run before the thermal_tz_list walk (or
        at any point during the list walk for that matter) and attempt to
        operate on a thermal zone that has not been resumed yet.  It may
        also race destructively with thermal_zone_device_init().
    
    To address these issues, add thermal_list_lock locking to
    thermal_pm_notify(), especially arount the thermal_tz_list,
    make it call thermal_zone_device_init() back-to-back with
    __thermal_zone_device_update() under the zone lock and replace
    in_suspend with per-zone bool "suspend" indicators set and unset
    under the given zone's lock.
    
    Link: https://lore.kernel.org/linux-pm/20231218162348.69101-1-bo.ye@mediatek.com/Reported-by: default avatarBo Ye <bo.ye@mediatek.com>
    Signed-off-by: default avatarRafael J. Wysocki <rafael.j.wysocki@intel.com>
    4e814173
thermal_core.c 41.6 KB