• Hugh Dickins's avatar
    memcg: further prevent OOM with too many dirty pages · c3b94f44
    Hugh Dickins authored
    The may_enter_fs test turns out to be too restrictive: though I saw no
    problem with it when testing on 3.5-rc6, it very soon OOMed when I tested
    on 3.5-rc6-mm1.  I don't know what the difference there is, perhaps I just
    slightly changed the way I started off the testing: dd if=/dev/zero
    of=/mnt/temp bs=1M count=1024; rm -f /mnt/temp; sync repeatedly, in 20M
    memory.limit_in_bytes cgroup to ext4 on USB stick.
    
    ext4 (and gfs2 and xfs) turn out to allocate new pages for writing with
    AOP_FLAG_NOFS: that seems a little worrying, and it's unclear to me why
    the transaction needs to be started even before allocating pagecache
    memory.  But it may not be worth worrying about these days: if direct
    reclaim avoids FS writeback, does __GFP_FS now mean anything?
    
    Anyway, we insisted on the may_enter_fs test to avoid hangs with the loop
    device; but since that also masks off __GFP_IO, we can test for __GFP_IO
    directly, ignoring may_enter_fs and __GFP_FS.
    
    But even so, the test still OOMs sometimes: when originally testing on
    3.5-rc6, it OOMed about one time in five or ten; when testing just now on
    3.5-rc6-mm1, it OOMed on the first iteration.
    
    This residual problem comes from an accumulation of pages under ordinary
    writeback, not marked PageReclaim, so rightly not causing the memcg check
    to wait on their writeback: these too can prevent shrink_page_list() from
    freeing any pages, so many times that memcg reclaim fails and OOMs.
    
    Deal with these in the same way as direct reclaim now deals with dirty FS
    pages: mark them PageReclaim.  It is appropriate to rotate these to tail
    of list when writepage completes, but more importantly, the PageReclaim
    flag makes memcg reclaim wait on them if encountered again.  Increment
    NR_VMSCAN_IMMEDIATE?  That's arguable: I chose not.
    
    Setting PageReclaim here may occasionally race with end_page_writeback()
    clearing it: lru_deactivate_fn() already faced the same race, and
    correctly concluded that the window is small and the issue non-critical.
    
    With these changes, the test runs indefinitely without OOMing on ext4,
    ext3 and ext2: I'll move on to test with other filesystems later.
    
    Trivia: invert conditions for a clearer block without an else, and goto
    keep_locked to do the unlock_page.
    Signed-off-by: default avatarHugh Dickins <hughd@google.com>
    Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujtisu.com>
    Cc: Minchan Kim <minchan@kernel.org>
    Cc: Rik van Riel <riel@redhat.com>
    Cc: Ying Han <yinghan@google.com>
    Cc: Greg Thelen <gthelen@google.com>
    Cc: Hugh Dickins <hughd@google.com>
    Cc: Mel Gorman <mgorman@suse.de>
    Cc: Johannes Weiner <hannes@cmpxchg.org>
    Cc: Fengguang Wu <fengguang.wu@intel.com>
    Acked-by: default avatarMichal Hocko <mhocko@suse.cz>
    Cc: Dave Chinner <david@fromorbit.com>
    Cc: Theodore Ts'o <tytso@mit.edu>
    Cc: <stable@vger.kernel.org>
    Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
    Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
    c3b94f44
vmscan.c 98.4 KB