• Tejun Heo's avatar
    workqueue: handle NUMA_NO_NODE for unbound pool_workqueue lookup · d6e022f1
    Tejun Heo authored
    When looking up the pool_workqueue to use for an unbound workqueue,
    workqueue assumes that the target CPU is always bound to a valid NUMA
    node.  However, currently, when a CPU goes offline, the mapping is
    destroyed and cpu_to_node() returns NUMA_NO_NODE.
    
    This has always been broken but hasn't triggered often enough before
    874bbfe6 ("workqueue: make sure delayed work run in local cpu").
    After the commit, workqueue forcifully assigns the local CPU for
    delayed work items without explicit target CPU to fix a different
    issue.  This widens the window where CPU can go offline while a
    delayed work item is pending causing delayed work items dispatched
    with target CPU set to an already offlined CPU.  The resulting
    NUMA_NO_NODE mapping makes workqueue try to queue the work item on a
    NULL pool_workqueue and thus crash.
    
    While 874bbfe6 has been reverted for a different reason making the
    bug less visible again, it can still happen.  Fix it by mapping
    NUMA_NO_NODE to the default pool_workqueue from unbound_pwq_by_node().
    This is a temporary workaround.  The long term solution is keeping CPU
    -> NODE mapping stable across CPU off/online cycles which is being
    worked on.
    Signed-off-by: default avatarTejun Heo <tj@kernel.org>
    Reported-by: default avatarMike Galbraith <umgwanakikbuti@gmail.com>
    Cc: Tang Chen <tangchen@cn.fujitsu.com>
    Cc: Rafael J. Wysocki <rafael@kernel.org>
    Cc: Len Brown <len.brown@intel.com>
    Cc: stable@vger.kernel.org
    Link: http://lkml.kernel.org/g/1454424264.11183.46.camel@gmail.com
    Link: http://lkml.kernel.org/g/1453702100-2597-1-git-send-email-tangchen@cn.fujitsu.com
    d6e022f1
workqueue.c 152 KB