• Johannes Weiner's avatar
    mm: fix 100% CPU kswapd busyloop on unreclaimable nodes · c73322d0
    Johannes Weiner authored
    Patch series "mm: kswapd spinning on unreclaimable nodes - fixes and
    cleanups".
    
    Jia reported a scenario in which the kswapd of a node indefinitely spins
    at 100% CPU usage.  We have seen similar cases at Facebook.
    
    The kernel's current method of judging its ability to reclaim a node (or
    whether to back off and sleep) is based on the amount of scanned pages
    in proportion to the amount of reclaimable pages.  In Jia's and our
    scenarios, there are no reclaimable pages in the node, however, and the
    condition for backing off is never met.  Kswapd busyloops in an attempt
    to restore the watermarks while having nothing to work with.
    
    This series reworks the definition of an unreclaimable node based not on
    scanning but on whether kswapd is able to actually reclaim pages in
    MAX_RECLAIM_RETRIES (16) consecutive runs.  This is the same criteria
    the page allocator uses for giving up on direct reclaim and invoking the
    OOM killer.  If it cannot free any pages, kswapd will go to sleep and
    leave further attempts to direct reclaim invocations, which will either
    make progress and re-enable kswapd, or invoke the OOM killer.
    
    Patch #1 fixes the immediate problem Jia reported, the remainder are
    smaller fixlets, cleanups, and overall phasing out of the old method.
    
    Patch #6 is the odd one out.  It's a nice cleanup to get_scan_count(),
    and directly related to #5, but in itself not relevant to the series.
    
    If the whole series is too ambitious for 4.11, I would consider the
    first three patches fixes, the rest cleanups.
    
    This patch (of 9):
    
    Jia He reports a problem with kswapd spinning at 100% CPU when
    requesting more hugepages than memory available in the system:
    
    $ echo 4000 >/proc/sys/vm/nr_hugepages
    
    top - 13:42:59 up  3:37,  1 user,  load average: 1.09, 1.03, 1.01
    Tasks:   1 total,   1 running,   0 sleeping,   0 stopped,   0 zombie
    %Cpu(s):  0.0 us, 12.5 sy,  0.0 ni, 85.5 id,  2.0 wa,  0.0 hi,  0.0 si,  0.0 st
    KiB Mem:  31371520 total, 30915136 used,   456384 free,      320 buffers
    KiB Swap:  6284224 total,   115712 used,  6168512 free.    48192 cached Mem
    
      PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND
       76 root      20   0       0      0      0 R 100.0 0.000 217:17.29 kswapd3
    
    At that time, there are no reclaimable pages left in the node, but as
    kswapd fails to restore the high watermarks it refuses to go to sleep.
    
    Kswapd needs to back away from nodes that fail to balance.  Up until
    commit 1d82de61 ("mm, vmscan: make kswapd reclaim in terms of
    nodes") kswapd had such a mechanism.  It considered zones whose
    theoretically reclaimable pages it had reclaimed six times over as
    unreclaimable and backed away from them.  This guard was erroneously
    removed as the patch changed the definition of a balanced node.
    
    However, simply restoring this code wouldn't help in the case reported
    here: there *are* no reclaimable pages that could be scanned until the
    threshold is met.  Kswapd would stay awake anyway.
    
    Introduce a new and much simpler way of backing off.  If kswapd runs
    through MAX_RECLAIM_RETRIES (16) cycles without reclaiming a single
    page, make it back off from the node.  This is the same number of shots
    direct reclaim takes before declaring OOM.  Kswapd will go to sleep on
    that node until a direct reclaimer manages to reclaim some pages, thus
    proving the node reclaimable again.
    
    [hannes@cmpxchg.org: check kswapd failure against the cumulative nr_reclaimed count]
      Link: http://lkml.kernel.org/r/20170306162410.GB2090@cmpxchg.org
    [shakeelb@google.com: fix condition for throttle_direct_reclaim]
      Link: http://lkml.kernel.org/r/20170314183228.20152-1-shakeelb@google.com
    Link: http://lkml.kernel.org/r/20170228214007.5621-2-hannes@cmpxchg.orgSigned-off-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
    Signed-off-by: default avatarShakeel Butt <shakeelb@google.com>
    Reported-by: default avatarJia He <hejianet@gmail.com>
    Tested-by: default avatarJia He <hejianet@gmail.com>
    Acked-by: default avatarMichal Hocko <mhocko@suse.com>
    Acked-by: default avatarHillf Danton <hillf.zj@alibaba-inc.com>
    Acked-by: default avatarMinchan Kim <minchan@kernel.org>
    Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
    Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
    c73322d0
vmscan.c 112 KB