• Andrew Morton's avatar
    [PATCH] filtered wakeups · 2f242854
    Andrew Morton authored
    From: William Lee Irwin III <wli@holomorphy.com>
    
    This patch series is solving the "thundering herd" problem that occurs in the
    mainline implementation of hashed waitqueues.  There are two sources of
    spurious wakeups in such arrangements:
    
    (a) Hash collisions that place waiters on different objects on the same
        waitqueue, which wakes threads falsely when any of the objects hashed to
        the same queue receives a wakeup.  i.e.  loss of information about which
        object a wakeup event is related to.
    
    (b) Loss of information about which object a given waiter is waiting on.
        This precludes wake-one semantics for mutual exclusion scenarios.  For
        instance, a lock bit may be slept on.  If there are any waiters on the
        object, a lock bit release event must wake at least one of them so as to
        prevent deadlock.  But without information as to which waiter is waiting
        on which object, we must resort to waking all waiters who could possibly
        be waiting on it.  Now, as the lock bit provides mutual exclusion, only
        one of the waiters woken can proceed, and the remainder will go back to
        sleep and wait for another event, creating unnecessary system load.  Once
        wake-one semantics are established, only one of the waiters waiting to
        acquire a lock bit need to be woken, which measurably reduces system load
        and improves efficiency (i.e.  it's the subject of the benchmarking I've
        been sending to you).
    
    Even beyond the measurable efficiency gains, there are reasons of robustness
    and responsiveness to motivate addressing the issue of thundering herds.  In a
    real-life scenario I've been personally involved in resolving, the thundering
    herd issue caused powerful modern SMP machines with fast IO systems to be
    unresponsive to user input for a minute at a time or more.  Analogues of these
    patches for the distro kernels involved fully resolved the issue to the
    customer's satisfaction and obviated workarounds to limit the pagecache's
    size.
    
    The latest spin of these patches basically shoves more pieces of the logic
    into the wakeup functions, with some efficiency gains from sharing the hot
    codepath with the rest of the kernel, and a slightly larger diff than the
    patches with the newly-introduced entrypoint.  Writing these was motivated by
    the push to insulate sched.c from more of the details of wakeup semantics by
    putting more of the logic into the wakeup functions.  In order to accomplish
    this while still solving (b), the wakeup functions grew a new argument for
    communication about what object a wakeup event is related to to be passed by
    the waker.
    
    =========
    
    This patch provides an additional argument to wakeup functions so that
    information may be passed from the waker to the waiter.  This is provided as a
    separate patch so that the overhead of the additional argument can be measured
    in isolation.  No change in performance was observable here.
    2f242854
fork.c 31.5 KB