• Tejun Heo's avatar
    workqueue: Implement system-wide nr_active enforcement for unbound workqueues · 5797b1c1
    Tejun Heo authored
    A pool_workqueue (pwq) represents the connection between a workqueue and a
    worker_pool. One of the roles that a pwq plays is enforcement of the
    max_active concurrency limit. Before 636b927e ("workqueue: Make unbound
    workqueues to use per-cpu pool_workqueues"), there was one pwq per each CPU
    for per-cpu workqueues and per each NUMA node for unbound workqueues, which
    was a natural result of per-cpu workqueues being served by per-cpu pools and
    unbound by per-NUMA pools.
    
    In terms of max_active enforcement, this was, while not perfect, workable.
    For per-cpu workqueues, it was fine. For unbound, it wasn't great in that
    NUMA machines would get max_active that's multiplied by the number of nodes
    but didn't cause huge problems because NUMA machines are relatively rare and
    the node count is usually pretty low.
    
    However, cache layouts are more complex now and sharing a worker pool across
    a whole node didn't really work well for unbound workqueues. Thus, a series
    of commits culminating on 8639eceb ("workqueue: Make unbound workqueues
    to use per-cpu pool_workqueues") implemented more flexible affinity
    mechanism for unbound workqueues which enables using e.g. last-level-cache
    aligned pools. In the process, 636b927e ("workqueue: Make unbound
    workqueues to use per-cpu pool_workqueues") made unbound workqueues use
    per-cpu pwqs like per-cpu workqueues.
    
    While the change was necessary to enable more flexible affinity scopes, this
    came with the side effect of blowing up the effective max_active for unbound
    workqueues. Before, the effective max_active for unbound workqueues was
    multiplied by the number of nodes. After, by the number of CPUs.
    
    636b927e ("workqueue: Make unbound workqueues to use per-cpu
    pool_workqueues") claims that this should generally be okay. It is okay for
    users which self-regulates concurrency level which are the vast majority;
    however, there are enough use cases which actually depend on max_active to
    prevent the level of concurrency from going bonkers including several IO
    handling workqueues that can issue a work item for each in-flight IO. With
    targeted benchmarks, the misbehavior can easily be exposed as reported in
    http://lkml.kernel.org/r/dbu6wiwu3sdhmhikb2w6lns7b27gbobfavhjj57kwi2quafgwl@htjcc5oikcr3.
    
    Unfortunately, there is no way to express what these use cases need using
    per-cpu max_active. A CPU may issue most of in-flight IOs, so we don't want
    to set max_active too low but as soon as we increase max_active a bit, we
    can end up with unreasonable number of in-flight work items when many CPUs
    issue IOs at the same time. ie. The acceptable lowest max_active is higher
    than the acceptable highest max_active.
    
    Ideally, max_active for an unbound workqueue should be system-wide so that
    the users can regulate the total level of concurrency regardless of node and
    cache layout. The reasons workqueue hasn't implemented that yet are:
    
    - One max_active enforcement decouples from pool boundaires, chaining
      execution after a work item finishes requires inter-pool operations which
      would require lock dancing, which is nasty.
    
    - Sharing a single nr_active count across the whole system can be pretty
      expensive on NUMA machines.
    
    - Per-pwq enforcement had been more or less okay while we were using
      per-node pools.
    
    It looks like we no longer can avoid decoupling max_active enforcement from
    pool boundaries. This patch implements system-wide nr_active mechanism with
    the following design characteristics:
    
    - To avoid sharing a single counter across multiple nodes, the configured
      max_active is split across nodes according to the proportion of each
      workqueue's online effective CPUs per node. e.g. A node with twice more
      online effective CPUs will get twice higher portion of max_active.
    
    - Workqueue used to be able to process a chain of interdependent work items
      which is as long as max_active. We can't do this anymore as max_active is
      distributed across the nodes. Instead, a new parameter min_active is
      introduced which determines the minimum level of concurrency within a node
      regardless of how max_active distribution comes out to be.
    
      It is set to the smaller of max_active and WQ_DFL_MIN_ACTIVE which is 8.
      This can lead to higher effective max_weight than configured and also
      deadlocks if a workqueue was depending on being able to handle chains of
      interdependent work items that are longer than 8.
    
      I believe these should be fine given that the number of CPUs in each NUMA
      node is usually higher than 8 and work item chain longer than 8 is pretty
      unlikely. However, if these assumptions turn out to be wrong, we'll need
      to add an interface to adjust min_active.
    
    - Each unbound wq has an array of struct wq_node_nr_active which tracks
      per-node nr_active. When its pwq wants to run a work item, it has to
      obtain the matching node's nr_active. If over the node's max_active, the
      pwq is queued on wq_node_nr_active->pending_pwqs. As work items finish,
      the completion path round-robins the pending pwqs activating the first
      inactive work item of each, which involves some pool lock dancing and
      kicking other pools. It's not the simplest code but doesn't look too bad.
    
    v4: - wq_adjust_max_active() updated to invoke wq_update_node_max_active().
    
        - wq_adjust_max_active() is now protected by wq->mutex instead of
          wq_pool_mutex.
    
    v3: - wq_node_max_active() used to calculate per-node max_active on the fly
          based on system-wide CPU online states. Lai pointed out that this can
          lead to skewed distributions for workqueues with restricted cpumasks.
          Update the max_active distribution to use per-workqueue effective
          online CPU counts instead of system-wide and cache the calculation
          results in node_nr_active->max.
    
    v2: - wq->min/max_active now uses WRITE/READ_ONCE() as suggested by Lai.
    Signed-off-by: default avatarTejun Heo <tj@kernel.org>
    Reported-by: default avatarNaohiro Aota <Naohiro.Aota@wdc.com>
    Link: http://lkml.kernel.org/r/dbu6wiwu3sdhmhikb2w6lns7b27gbobfavhjj57kwi2quafgwl@htjcc5oikcr3
    Fixes: 636b927e ("workqueue: Make unbound workqueues to use per-cpu pool_workqueues")
    Reviewed-by: default avatarLai Jiangshan <jiangshanlai@gmail.com>
    5797b1c1
workqueue.c 207 KB