• Linus Torvalds's avatar
    Minor page waitqueue cleanups · 3510ca20
    Linus Torvalds authored
    Tim Chen and Kan Liang have been battling a customer load that shows
    extremely long page wakeup lists.  The cause seems to be constant NUMA
    migration of a hot page that is shared across a lot of threads, but the
    actual root cause for the exact behavior has not been found.
    
    Tim has a patch that batches the wait list traversal at wakeup time, so
    that we at least don't get long uninterruptible cases where we traverse
    and wake up thousands of processes and get nasty latency spikes.  That
    is likely 4.14 material, but we're still discussing the page waitqueue
    specific parts of it.
    
    In the meantime, I've tried to look at making the page wait queues less
    expensive, and failing miserably.  If you have thousands of threads
    waiting for the same page, it will be painful.  We'll need to try to
    figure out the NUMA balancing issue some day, in addition to avoiding
    the excessive spinlock hold times.
    
    That said, having tried to rewrite the page wait queues, I can at least
    fix up some of the braindamage in the current situation. In particular:
    
     (a) we don't want to continue walking the page wait list if the bit
         we're waiting for already got set again (which seems to be one of
         the patterns of the bad load).  That makes no progress and just
         causes pointless cache pollution chasing the pointers.
    
     (b) we don't want to put the non-locking waiters always on the front of
         the queue, and the locking waiters always on the back.  Not only is
         that unfair, it means that we wake up thousands of reading threads
         that will just end up being blocked by the writer later anyway.
    
    Also add a comment about the layout of 'struct wait_page_key' - there is
    an external user of it in the cachefiles code that means that it has to
    match the layout of 'struct wait_bit_key' in the two first members.  It
    so happens to match, because 'struct page *' and 'unsigned long *' end
    up having the same values simply because the page flags are the first
    member in struct page.
    
    Cc: Tim Chen <tim.c.chen@linux.intel.com>
    Cc: Kan Liang <kan.liang@intel.com>
    Cc: Mel Gorman <mgorman@techsingularity.net>
    Cc: Christopher Lameter <cl@linux.com>
    Cc: Andi Kleen <ak@linux.intel.com>
    Cc: Davidlohr Bueso <dave@stgolabs.net>
    Cc: Peter Zijlstra <peterz@infradead.org>
    Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
    3510ca20
filemap.c 84.7 KB