• Tejun Heo's avatar
    workqueue: Implement non-strict affinity scope for unbound workqueues · 8639eceb
    Tejun Heo authored
    An unbound workqueue can be served by multiple worker_pools to improve
    locality. The segmentation is achieved by grouping CPUs into pods. By
    default, the cache boundaries according to cpus_share_cache() define the
    CPUs are grouped. Let's a workqueue is allowed to run on all CPUs and the
    system has two L3 caches. The workqueue would be mapped to two worker_pools
    each serving one L3 cache domains.
    
    While this improves locality, because the pod boundaries are strict, it
    limits the total bandwidth a given issuer can consume. For example, let's
    say there is a thread pinned to a CPU issuing enough work items to saturate
    the whole machine. With the machine segmented into two pods, no matter how
    many work items it issues, it can only use half of the CPUs on the system.
    
    While this limitation has existed for a very long time, it wasn't very
    pronounced because the affinity grouping used to be always by NUMA nodes.
    With cache boundaries as the default and support for even finer grained
    scopes (smt and cpu), it is now an a lot more pressing problem.
    
    This patch implements non-strict affinity scope where the pod boundaries
    aren't enforced strictly. Going back to the previous example, the workqueue
    would still be mapped to two worker_pools; however, the affinity enforcement
    would be soft. The workers in both pools would have their cpus_allowed set
    to the whole machine thus allowing the scheduler to migrate them anywhere on
    the machine. However, whenever an idle worker is woken up, the workqueue
    code asks the scheduler to bring back the task within the pod if the worker
    is outside. ie. work items start executing within its affinity scope but can
    be migrated outside as the scheduler sees fit. This removes the hard cap on
    utilization while maintaining the benefits of affinity scopes.
    
    After the earlier ->__pod_cpumask changes, the implementation is pretty
    simple. When non-strict which is the new default:
    
    * pool_allowed_cpus() returns @pool->attrs->cpumask instead of
      ->__pod_cpumask so that the workers are allowed to run on any CPU that
      the associated workqueues allow.
    
    * If the idle worker task's ->wake_cpu is outside the pod, kick_pool() sets
      the field to a CPU within the pod.
    
    This would be the first use of task_struct->wake_cpu outside scheduler
    proper, so it isn't clear whether this would be acceptable. However, other
    methods of migrating tasks are significantly more expensive and are likely
    prohibitively so if we want to do this on every work item. This needs
    discussion with scheduler folks.
    
    There is also a race window where setting ->wake_cpu wouldn't be effective
    as the target task is still on CPU. However, the window is pretty small and
    this being a best-effort optimization, it doesn't seem to warrant more
    complexity at the moment.
    
    While the non-strict cache affinity scopes seem to be the best option, the
    performance picture interacts with the affinity scope and is a bit
    complicated to fully discuss in this patch, so the behavior is made easily
    selectable through wqattrs and sysfs and the next patch will add
    documentation to discuss performance implications.
    
    v2: pool->attrs->affn_strict is set to true for per-cpu worker_pools.
    Signed-off-by: default avatarTejun Heo <tj@kernel.org>
    Cc: Peter Zijlstra <peterz@infradead.org>
    Cc: Linus Torvalds <torvalds@linux-foundation.org>
    8639eceb
wq_dump.py 5.64 KB