1. 29 May, 2012 40 commits
    • Hugh Dickins's avatar
      tmpfs: support SEEK_DATA and SEEK_HOLE · 4fb5ef08
      Hugh Dickins authored
      It's quite easy for tmpfs to scan the radix_tree to support llseek's new
      SEEK_DATA and SEEK_HOLE options: so add them while the minutiae are still
      on my mind (in particular, the !PageUptodate-ness of pages fallocated but
      still unwritten).
      
      But I don't know who actually uses SEEK_DATA or SEEK_HOLE, and whether it
      would be of any use to them on tmpfs.  This code adds 92 lines and 752
      bytes on x86_64 - is that bloat or worthwhile?
      
      [akpm@linux-foundation.org: fix warning with CONFIG_TMPFS=n]
      Signed-off-by: default avatarHugh Dickins <hughd@google.com>
      Cc: Christoph Hellwig <hch@infradead.org>
      Cc: Josef Bacik <josef@redhat.com>
      Cc: Andi Kleen <andi@firstfloor.org>
      Cc: Andreas Dilger <adilger@dilger.ca>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Marco Stornelli <marco.stornelli@gmail.com>
      Cc: Jeff liu <jeff.liu@oracle.com>
      Cc: Chris Mason <chris.mason@oracle.com>
      Cc: Sunil Mushran <sunil.mushran@oracle.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      4fb5ef08
    • Hugh Dickins's avatar
      tmpfs: quit when fallocate fills memory · 1aac1400
      Hugh Dickins authored
      As it stands, a large fallocate() on tmpfs is liable to fill memory with
      pages, freed on failure except when they run into swap, at which point
      they become fixed into the file despite the failure.  That feels quite
      wrong, to be consuming resources precisely when they're in short supply.
      
      Go the other way instead: shmem_fallocate() indicate the range it has
      fallocated to shmem_writepage(), keeping count of pages it's allocating;
      shmem_writepage() reactivate instead of swapping out pages fallocated by
      this syscall (but happily swap out those from earlier occasions), keeping
      count; shmem_fallocate() compare counts and give up once the reactivated
      pages have started to coming back to writepage (approximately: some zones
      would in fact recycle faster than others).
      
      This is a little unusual, but works well: although we could consider the
      failure to swap as a bug, and fix it later with SWAP_MAP_FALLOC handling
      added in swapfile.c and memcontrol.c, I doubt that we shall ever want to.
      
      (If there's no swap, an over-large fallocate() on tmpfs is limited in the
      same way as writing: stopped by rlimit, or by tmpfs mount size if that was
      set sensibly, or by __vm_enough_memory() heuristics if OVERCOMMIT_GUESS or
      OVERCOMMIT_NEVER.  If OVERCOMMIT_ALWAYS, then it is liable to OOM-kill
      others as writing would, but stops and frees if interrupted.)
      
      Now that everything is freed on failure, we can then skip updating ctime.
      Signed-off-by: default avatarHugh Dickins <hughd@google.com>
      Cc: Christoph Hellwig <hch@infradead.org>
      Cc: Cong Wang <amwang@redhat.com>
      Cc: Kay Sievers <kay@vrfy.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      1aac1400
    • Hugh Dickins's avatar
      tmpfs: undo fallocation on failure · 1635f6a7
      Hugh Dickins authored
      In the previous episode, we left the already-fallocated pages attached to
      the file when shmem_fallocate() fails part way through.
      
      Now try to do better, by extending the earlier optimization of !Uptodate
      pages (then always under page lock) to !Uptodate pages (outside of page
      lock), representing fallocated pages.  And don't waste time clearing them
      at the time of fallocate(), leave that until later if necessary.
      
      Adapt shmem_truncate_range() to shmem_undo_range(), so that a failing
      fallocate can recognize and remove precisely those !Uptodate allocations
      which it added (and were not independently allocated by racing tasks).
      
      But unless we start playing with swapfile.c and memcontrol.c too, once one
      of our fallocated pages reaches shmem_writepage(), we do then have to
      instantiate it as an ordinarily allocated page, before swapping out.  This
      is unsatisfactory, but improved in the next episode.
      Signed-off-by: default avatarHugh Dickins <hughd@google.com>
      Cc: Christoph Hellwig <hch@infradead.org>
      Cc: Cong Wang <amwang@redhat.com>
      Cc: Kay Sievers <kay@vrfy.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      1635f6a7
    • Hugh Dickins's avatar
      tmpfs: support fallocate preallocation · e2d12e22
      Hugh Dickins authored
      The systemd plumbers expressed a wish that tmpfs support preallocation.
      Cong Wang wrote a patch, but several kernel guys expressed scepticism:
      https://lkml.org/lkml/2011/11/18/137
      
      Christoph Hellwig: What for exactly? Please explain why preallocating on
      tmpfs would make any sense.
      
      Kay Sievers: To be able to safely use mmap(), regarding SIGBUS, on files
      on the /dev/shm filesystem.  The glibc fallback loop for -ENOSYS [or
      -EOPNOTSUPP] on fallocate is just ugly.
      
      Hugh Dickins: If tmpfs is going to support
      fallocate(FALLOC_FL_PUNCH_HOLE), it would seem perverse to permit the
      deallocation but fail the allocation.  Christoph Hellwig: Agreed.
      
      Now that we do have shmem_fallocate() for hole-punching, plumb in basic
      support for preallocation mode too.  It's fairly straightforward (though
      quite a few details needed attention), except for when it fails part way
      through.  What a pity that fallocate(2) was not specified to return the
      length allocated, permitting short fallocations!
      
      As it is, when it fails part way through, we ought to free what has just
      been allocated by this system call; but must be very sure not to free any
      allocated earlier, or any allocated by racing accesses (not all excluded
      by i_mutex).
      
      But we cannot distinguish them: so in this patch simply leak allocations
      on partial failure (they will be freed later if the file is removed).
      
      An attractive alternative approach would have been for fallocate() not to
      allocate pages at all, but note reservations by entries in the radix-tree.
       But that would give less assurance, and, critically, would be hard to fit
      with mem cgroups (who owns the reservations?): allocating pages lets
      fallocate() behave in just the same way as write().
      Based-on-patch-by: default avatarCong Wang <amwang@redhat.com>
      Signed-off-by: default avatarHugh Dickins <hughd@google.com>
      Cc: Christoph Hellwig <hch@infradead.org>
      Cc: Cong Wang <amwang@redhat.com>
      Cc: Kay Sievers <kay@vrfy.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      e2d12e22
    • Hugh Dickins's avatar
      mm/fs: remove truncate_range · 17cf28af
      Hugh Dickins authored
      Remove vmtruncate_range(), and remove the truncate_range method from
      struct inode_operations: only tmpfs ever supported it, and tmpfs has now
      converted over to using the fallocate method of file_operations.
      
      Update Documentation accordingly, adding (setlease and) fallocate lines.
      And while we're in mm.h, remove duplicate declarations of shmem_lock() and
      shmem_file_setup(): everyone is now using the ones in shmem_fs.h.
      Based-on-patch-by: default avatarCong Wang <amwang@redhat.com>
      Signed-off-by: default avatarHugh Dickins <hughd@google.com>
      Cc: Christoph Hellwig <hch@infradead.org>
      Cc: Cong Wang <amwang@redhat.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      17cf28af
    • Hugh Dickins's avatar
      mm/fs: route MADV_REMOVE to FALLOC_FL_PUNCH_HOLE · 3f31d075
      Hugh Dickins authored
      Now tmpfs supports hole-punching via fallocate(), switch madvise_remove()
      to use do_fallocate() instead of vmtruncate_range(): which extends
      madvise(,,MADV_REMOVE) support from tmpfs to ext4, ocfs2 and xfs.
      
      There is one more user of vmtruncate_range() in our tree,
      staging/android's ashmem_shrink(): convert it to use do_fallocate() too
      (but if its unpinned areas are already unmapped - I don't know - then it
      would do better to use shmem_truncate_range() directly).
      Based-on-patch-by: default avatarCong Wang <amwang@redhat.com>
      Signed-off-by: default avatarHugh Dickins <hughd@google.com>
      Cc: Christoph Hellwig <hch@infradead.org>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Colin Cross <ccross@android.com>
      Cc: John Stultz <john.stultz@linaro.org>
      Cc: Greg Kroah-Hartman <gregkh@linux-foundation.org>
      Cc: "Theodore Ts'o" <tytso@mit.edu>
      Cc: Andreas Dilger <adilger@dilger.ca>
      Cc: Mark Fasheh <mfasheh@suse.de>
      Cc: Joel Becker <jlbec@evilplan.org>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Ben Myers <bpm@sgi.com>
      Cc: Michael Kerrisk <mtk.manpages@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      3f31d075
    • Hugh Dickins's avatar
      tmpfs: support fallocate FALLOC_FL_PUNCH_HOLE · 83e4fa9c
      Hugh Dickins authored
      tmpfs has supported hole-punching since 2.6.16, via
      madvise(,,MADV_REMOVE).
      
      But nowadays fallocate(,FALLOC_FL_PUNCH_HOLE|FALLOC_FL_KEEP_SIZE,,) is
      the agreed way to punch holes.
      
      So add shmem_fallocate() to support that, and tweak shmem_truncate_range()
      to support partial pages at both the beginning and end of range (never
      needed for madvise, which demands rounded addr and rounds up length).
      Based-on-patch-by: default avatarCong Wang <amwang@redhat.com>
      Signed-off-by: default avatarHugh Dickins <hughd@google.com>
      Cc: Christoph Hellwig <hch@infradead.org>
      Cc: Cong Wang <amwang@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      83e4fa9c
    • Hugh Dickins's avatar
      tmpfs: optimize clearing when writing · ec9516fb
      Hugh Dickins authored
      Nick proposed years ago that tmpfs should avoid clearing its pages where
      write will overwrite them with new data, as ramfs has long done.  But I
      messed it up and just got bad data.  Tried again recently, it works
      fine.
      
      Here's time output for writing 4GiB 16 times on this Core i5 laptop:
      
      before: real	0m21.169s user	0m0.028s sys	0m21.057s
              real	0m21.382s user	0m0.016s sys	0m21.289s
              real	0m21.311s user	0m0.020s sys	0m21.217s
      
      after:  real	0m18.273s user	0m0.032s sys	0m18.165s
              real	0m18.354s user	0m0.020s sys	0m18.265s
              real	0m18.440s user	0m0.032s sys	0m18.337s
      
      ramfs:  real	0m16.860s user	0m0.028s sys	0m16.765s
              real	0m17.382s user	0m0.040s sys	0m17.273s
              real	0m17.133s user	0m0.044s sys	0m17.021s
      
      Yes, I have done perf reports, but they need more explanation than they
      deserve: in summary, clear_page vanishes, its cache loading shifts into
      copy_user_generic_unrolled; shmem_getpage_gfp goes down, and
      surprisingly mark_page_accessed goes way up - I think because they are
      respectively where the cache gets to be reloaded after being purged by
      clear or copy.
      Suggested-by: default avatarNick Piggin <npiggin@gmail.com>
      Signed-off-by: default avatarHugh Dickins <hughd@google.com>
      Cc: Christoph Hellwig <hch@infradead.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      ec9516fb
    • Hugh Dickins's avatar
      tmpfs: enable NOSEC optimization · 2f6e38f3
      Hugh Dickins authored
      Let tmpfs into the NOSEC optimization (avoiding file_remove_suid()
      overhead on most common writes): set MS_NOSEC on its superblocks.
      Signed-off-by: default avatarHugh Dickins <hughd@google.com>
      Cc: Christoph Hellwig <hch@infradead.org>
      Cc: Andi Kleen <andi@firstfloor.org>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Cong Wang <xiyou.wangcong@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      2f6e38f3
    • Hugh Dickins's avatar
      shmem: replace page if mapping excludes its zone · bde05d1c
      Hugh Dickins authored
      The GMA500 GPU driver uses GEM shmem objects, but with a new twist: the
      backing RAM has to be below 4GB.  Not a problem while the boards
      supported only 4GB: but now Intel's D2700MUD boards support 8GB, and
      their GMA3600 is managed by the GMA500 driver.
      
      shmem/tmpfs has never pretended to support hardware restrictions on the
      backing memory, but it might have appeared to do so before v3.1, and
      even now it works fine until a page is swapped out then back in.  When
      read_cache_page_gfp() supplied a freshly allocated page for copy, that
      compensated for whatever choice might have been made by earlier swapin
      readahead; but swapoff was likely to destroy the illusion.
      
      We'd like to continue to support GMA500, so now add a new
      shmem_should_replace_page() check on the zone when about to move a page
      from swapcache to filecache (in swapin and swapoff cases), with
      shmem_replace_page() to allocate and substitute a suitable page (given
      gma500/gem.c's mapping_set_gfp_mask GFP_KERNEL | __GFP_DMA32).
      
      This does involve a minor extension to mem_cgroup_replace_page_cache()
      (the page may or may not have already been charged); and I've removed a
      comment and call to mem_cgroup_uncharge_cache_page(), which in fact is
      always a no-op while PageSwapCache.
      
      Also removed optimization of an unlikely path in shmem_getpage_gfp(),
      now that we need to check PageSwapCache more carefully (a racing caller
      might already have made the copy).  And at one point shmem_unuse_inode()
      needs to use the hitherto private page_swapcount(), to guard against
      racing with inode eviction.
      
      It would make sense to extend shmem_should_replace_page(), to cover
      cpuset and NUMA mempolicy restrictions too, but set that aside for now:
      needs a cleanup of shmem mempolicy handling, and more testing, and ought
      to handle swap faults in do_swap_page() as well as shmem.
      Signed-off-by: default avatarHugh Dickins <hughd@google.com>
      Cc: Christoph Hellwig <hch@infradead.org>
      Acked-by: default avatarKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Alan Cox <alan@lxorguk.ukuu.org.uk>
      Cc: Stephane Marchesin <marcheu@chromium.org>
      Cc: Andi Kleen <andi@firstfloor.org>
      Cc: Dave Airlie <airlied@gmail.com>
      Cc: Daniel Vetter <daniel@ffwll.ch>
      Cc: Rob Clark <rob.clark@linaro.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      bde05d1c
    • Bartlomiej Zolnierkiewicz's avatar
      mm: compaction: handle incorrect MIGRATE_UNMOVABLE type pageblocks · 5ceb9ce6
      Bartlomiej Zolnierkiewicz authored
      When MIGRATE_UNMOVABLE pages are freed from MIGRATE_UNMOVABLE type
      pageblock (and some MIGRATE_MOVABLE pages are left in it) waiting until an
      allocation takes ownership of the block may take too long.  The type of
      the pageblock remains unchanged so the pageblock cannot be used as a
      migration target during compaction.
      
      Fix it by:
      
      * Adding enum compact_mode (COMPACT_ASYNC_[MOVABLE,UNMOVABLE], and
        COMPACT_SYNC) and then converting sync field in struct compact_control
        to use it.
      
      * Adding nr_pageblocks_skipped field to struct compact_control and
        tracking how many destination pageblocks were of MIGRATE_UNMOVABLE type.
         If COMPACT_ASYNC_MOVABLE mode compaction ran fully in
        try_to_compact_pages() (COMPACT_COMPLETE) it implies that there is not a
        suitable page for allocation.  In this case then check how if there were
        enough MIGRATE_UNMOVABLE pageblocks to try a second pass in
        COMPACT_ASYNC_UNMOVABLE mode.
      
      * Scanning the MIGRATE_UNMOVABLE pageblocks (during COMPACT_SYNC and
        COMPACT_ASYNC_UNMOVABLE compaction modes) and building a count based on
        finding PageBuddy pages, page_count(page) == 0 or PageLRU pages.  If all
        pages within the MIGRATE_UNMOVABLE pageblock are in one of those three
        sets change the whole pageblock type to MIGRATE_MOVABLE.
      
      My particular test case (on a ARM EXYNOS4 device with 512 MiB, which means
      131072 standard 4KiB pages in 'Normal' zone) is to:
      
      - allocate 120000 pages for kernel's usage
      - free every second page (60000 pages) of memory just allocated
      - allocate and use 60000 pages from user space
      - free remaining 60000 pages of kernel memory
        (now we have fragmented memory occupied mostly by user space pages)
      - try to allocate 100 order-9 (2048 KiB) pages for kernel's usage
      
      The results:
      - with compaction disabled I get 11 successful allocations
      - with compaction enabled - 14 successful allocations
      - with this patch I'm able to get all 100 successful allocations
      
      NOTE: If we can make kswapd aware of order-0 request during compaction, we
      can enhance kswapd with changing mode to COMPACT_ASYNC_FULL
      (COMPACT_ASYNC_MOVABLE + COMPACT_ASYNC_UNMOVABLE).  Please see the
      following thread:
      
      	http://marc.info/?l=linux-mm&m=133552069417068&w=2
      
      [minchan@kernel.org: minor cleanups]
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Marek Szyprowski <m.szyprowski@samsung.com>
      Signed-off-by: default avatarBartlomiej Zolnierkiewicz <b.zolnierkie@samsung.com>
      Signed-off-by: default avatarKyungmin Park <kyungmin.park@samsung.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      5ceb9ce6
    • Johannes Weiner's avatar
      mm: remove sparsemem allocation details from the bootmem allocator · 238305bb
      Johannes Weiner authored
      alloc_bootmem_section() derives allocation area constraints from the
      specified sparsemem section.  This is a bit specific for a generic memory
      allocator like bootmem, though, so move it over to sparsemem.
      
      As __alloc_bootmem_node_nopanic() already retries failed allocations with
      relaxed area constraints, the fallback code in sparsemem.c can be removed
      and the code becomes a bit more compact overall.
      
      [akpm@linux-foundation.org: fix build]
      Signed-off-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: default avatarTejun Heo <tj@kernel.org>
      Acked-by: default avatarDavid S. Miller <davem@davemloft.net>
      Cc: Yinghai Lu <yinghai@kernel.org>
      Cc: Gavin Shan <shangw@linux.vnet.ibm.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      238305bb
    • Johannes Weiner's avatar
      mm: bootmem: pass pgdat instead of pgdat->bdata down the stack · e9079911
      Johannes Weiner authored
      Pass down the node descriptor instead of the more specific bootmem node
      descriptor down the call stack, like nobootmem does, when there is no good
      reason for the two to be different.
      Signed-off-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: default avatarTejun Heo <tj@kernel.org>
      Acked-by: default avatarDavid S. Miller <davem@davemloft.net>
      Cc: Yinghai Lu <yinghai@kernel.org>
      Cc: Gavin Shan <shangw@linux.vnet.ibm.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      e9079911
    • Johannes Weiner's avatar
      mm: nobootmem: unify allocation policy of (non-)panicking node allocations · ba539868
      Johannes Weiner authored
      While the panicking node-specific allocation function tries to satisfy
      node+goal, goal, node, anywhere, the non-panicking function still does
      node+goal, goal, anywhere.
      
      Make it simpler: define the panicking version in terms of the non-panicking
      one, like the node-agnostic interface, so they always behave the same way
      apart from how to deal with allocation failure.
      Signed-off-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: default avatarYinghai Lu <yinghai@kernel.org>
      Acked-by: default avatarTejun Heo <tj@kernel.org>
      Acked-by: default avatarDavid S. Miller <davem@davemloft.net>
      Cc: Gavin Shan <shangw@linux.vnet.ibm.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      ba539868
    • Johannes Weiner's avatar
      mm: nobootmem: panic on node-specific allocation failure · 2c478eae
      Johannes Weiner authored
      __alloc_bootmem_node and __alloc_bootmem_low_node documentation claims
      the functions panic on allocation failure.  Do it.
      Signed-off-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: default avatarYinghai Lu <yinghai@kernel.org>
      Acked-by: default avatarTejun Heo <tj@kernel.org>
      Acked-by: default avatarDavid S. Miller <davem@davemloft.net>
      Cc: Gavin Shan <shangw@linux.vnet.ibm.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      2c478eae
    • Johannes Weiner's avatar
      mm: bootmem: unify allocation policy of (non-)panicking node allocations · 421456ed
      Johannes Weiner authored
      While the panicking node-specific allocation function tries to satisfy
      node+goal, goal, node, anywhere, the non-panicking function still does
      node+goal, goal, anywhere.
      
      Make it simpler: define the panicking version in terms of the
      non-panicking one, like the node-agnostic interface, so they always behave
      the same way apart from how to deal with allocation failure.
      Signed-off-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: default avatarTejun Heo <tj@kernel.org>
      Acked-by: default avatarDavid S. Miller <davem@davemloft.net>
      Cc: Yinghai Lu <yinghai@kernel.org>
      Cc: Gavin Shan <shangw@linux.vnet.ibm.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      421456ed
    • Johannes Weiner's avatar
      mm: bootmem: allocate in order node+goal, goal, node, anywhere · ab381843
      Johannes Weiner authored
      Match the nobootmem version of __alloc_bootmem_node.  Try to satisfy both
      the node and the goal, then just the goal, then just the node, then
      allocate anywhere before panicking.
      Signed-off-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: default avatarTejun Heo <tj@kernel.org>
      Acked-by: default avatarDavid S. Miller <davem@davemloft.net>
      Cc: Yinghai Lu <yinghai@kernel.org>
      Cc: Gavin Shan <shangw@linux.vnet.ibm.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      ab381843
    • Johannes Weiner's avatar
      mm: bootmem: split out goal-to-node mapping from goal dropping · c12ab504
      Johannes Weiner authored
      Matching the desired goal to the right node is one thing, dropping the
      goal when it can not be satisfied is another.  Split this into separate
      functions so that subsequent patches can use the node-finding but drop and
      handle the goal fallback on their own terms.
      Signed-off-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: default avatarTejun Heo <tj@kernel.org>
      Acked-by: default avatarDavid S. Miller <davem@davemloft.net>
      Cc: Yinghai Lu <yinghai@kernel.org>
      Cc: Gavin Shan <shangw@linux.vnet.ibm.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      c12ab504
    • Johannes Weiner's avatar
      mm: bootmem: rename alloc_bootmem_core to alloc_bootmem_bdata · c6785b6b
      Johannes Weiner authored
      Callsites need to provide a bootmem_data_t *, make the naming more
      descriptive.
      Signed-off-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: default avatarTejun Heo <tj@kernel.org>
      Acked-by: default avatarDavid S. Miller <davem@davemloft.net>
      Cc: Yinghai Lu <yinghai@kernel.org>
      Cc: Gavin Shan <shangw@linux.vnet.ibm.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      c6785b6b
    • Johannes Weiner's avatar
      mm: bootmem: remove redundant offset check when finally freeing bootmem · 549381e1
      Johannes Weiner authored
      When bootmem releases an unaligned BITS_PER_LONG pages chunk of memory
      to the page allocator, it checks the bitmap if there are still
      unreserved pages in the chunk (set bits), but also if the offset in the
      chunk indicates BITS_PER_LONG loop iterations already.
      
      But since the consulted bitmap is only a one-word-excerpt of the full
      per-node bitmap, there can not be more than BITS_PER_LONG bits set in
      it.  The additional offset check is unnecessary.
      Signed-off-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: default avatarTejun Heo <tj@kernel.org>
      Acked-by: default avatarDavid S. Miller <davem@davemloft.net>
      Cc: Yinghai Lu <yinghai@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      549381e1
    • Gavin Shan's avatar
      mm: bootmem: fix checking the bitmap when finally freeing bootmem · 6dccdcbe
      Gavin Shan authored
      When bootmem releases an unaligned chunk of memory at the beginning of a
      node to the page allocator, it iterates from that unaligned PFN but
      checks an aligned word of the page bitmap.  The checked bits do not
      correspond to the PFNs and, as a result, reserved pages can be freed.
      
      Properly shift the bitmap word so that the lowest bit corresponds to the
      starting PFN before entering the freeing loop.
      
      This bug has been around since commit 41546c17 ("bootmem: clean up
      free_all_bootmem_core") (2.6.27) without known reports.
      Signed-off-by: default avatarGavin Shan <shangw@linux.vnet.ibm.com>
      Signed-off-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: default avatarTejun Heo <tj@kernel.org>
      Acked-by: default avatarDavid S. Miller <davem@davemloft.net>
      Cc: Yinghai Lu <yinghai@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      6dccdcbe
    • Andrew Morton's avatar
      mm/page_alloc.c: remove pageblock_default_order() · 955c1cd7
      Andrew Morton authored
      This has always been broken: one version takes an unsigned int and the
      other version takes no arguments.  This bug was hidden because one
      version of set_pageblock_order() was a macro which doesn't evaluate its
      argument.
      
      Simplify it all and remove pageblock_default_order() altogether.
      Reported-by: default avatarrajman mekaco <rajman.mekaco@gmail.com>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Minchan Kim <minchan.kim@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      955c1cd7
    • Alex Shi's avatar
      mm: move is_vma_temporary_stack() declaration to huge_mm.h · 20995974
      Alex Shi authored
      When transparent_hugepage_enabled() is used outside mm/, such as in
      arch/x86/xx/tlb.c:
      
      +       if (!cpu_has_invlpg || vma->vm_flags & VM_HUGETLB
      +                       || transparent_hugepage_enabled(vma)) {
      +               flush_tlb_mm(vma->vm_mm);
      
      is_vma_temporary_stack() isn't referenced in huge_mm.h, so it has compile
      errors:
      
        arch/x86/mm/tlb.c: In function `flush_tlb_range':
        arch/x86/mm/tlb.c:324:4: error: implicit declaration of function `is_vma_temporary_stack' [-Werror=implicit-function-declaration]
      
      Since is_vma_temporay_stack() is just used in rmap.c and huge_memory.c, it
      is better to move it to huge_mm.h from rmap.h to avoid such errors.
      Signed-off-by: default avatarAlex Shi <alex.shi@intel.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      20995974
    • Ulrich Drepper's avatar
      tools/vm/page-types.c: cleanups · e30d539b
      Ulrich Drepper authored
      Compiling page-type.c with a recent compiler produces many warnings,
      mostly related to signed/unsigned comparisons.  This patch cleans up most
      of them.
      
      One remaining warning is about an unused parameter.  The <compiler.h> file
      doesn't define a __unused macro (or the like) yet.  This can be addressed
      later.
      Signed-off-by: default avatarUlrich Drepper <drepper@gmail.com>
      Acked-by: default avatarKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Acked-by: default avatarFengguang Wu <fengguang.wu@intel.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      e30d539b
    • Ulrich Drepper's avatar
      kbuild: install kernel-page-flags.h · 9295b7a0
      Ulrich Drepper authored
      Programs using /proc/kpageflags need to know about the various flags.  The
      <linux/kernel-page-flags.h> provides them and the comments in the file
      indicate that it is supposed to be used by user-level code.  But the file
      is not installed.
      
      Install the headers and mark the unstable flags as out-of-bounds.  The
      page-type tool is also adjusted to not duplicate the definitions
      Signed-off-by: default avatarUlrich Drepper <drepper@gmail.com>
      Acked-by: default avatarKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Acked-by: default avatarFengguang Wu <fengguang.wu@intel.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      9295b7a0
    • Bjorn Helgaas's avatar
      mm: print physical addresses consistently with other parts of kernel · a62e2f4f
      Bjorn Helgaas authored
      Print physical address info in a style consistent with the %pR style used
      elsewhere in the kernel.  For example:
      
          -Zone PFN ranges:
          +Zone ranges:
          -  DMA32    0x00000010 -> 0x00100000
          +  DMA32    [mem 0x00010000-0xffffffff]
          -  Normal   0x00100000 -> 0x01080000
          +  Normal   [mem 0x100000000-0x107fffffff]
      Signed-off-by: default avatarBjorn Helgaas <bhelgaas@google.com>
      Cc: Yinghai Lu <yinghai@kernel.org>
      Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      a62e2f4f
    • Bjorn Helgaas's avatar
      swiotlb: print physical addresses consistently with other parts of kernel · 3af684c7
      Bjorn Helgaas authored
      Print swiotlb info in a style consistent with the %pR style used elsewhere
      in the kernel.  For example:
      
          -Placing 64MB software IO TLB between ffff88007a662000 - ffff88007e662000
          -software IO TLB at phys 0x7a662000 - 0x7e662000
          +software IO TLB [mem 0x7a662000-0x7e661fff] (64MB) mapped at [ffff88007a662000-ffff88007e661fff]
      Signed-off-by: default avatarBjorn Helgaas <bhelgaas@google.com>
      Cc: Yinghai Lu <yinghai@kernel.org>
      Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      3af684c7
    • Bjorn Helgaas's avatar
      x86: print physical addresses consistently with other parts of kernel · 365811d6
      Bjorn Helgaas authored
      Print physical address info in a style consistent with the %pR style used
      elsewhere in the kernel.  For example:
      
          -found SMP MP-table at [ffff8800000fce90] fce90
          +found SMP MP-table at [mem 0x000fce90-0x000fce9f] mapped at [ffff8800000fce90]
          -initial memory mapped : 0 - 20000000
          +initial memory mapped: [mem 0x00000000-0x1fffffff]
          -Base memory trampoline at [ffff88000009c000] 9c000 size 8192
          +Base memory trampoline [mem 0x0009c000-0x0009dfff] mapped at [ffff88000009c000]
          -SRAT: Node 0 PXM 0 0-80000000
          +SRAT: Node 0 PXM 0 [mem 0x00000000-0x7fffffff]
      Signed-off-by: default avatarBjorn Helgaas <bhelgaas@google.com>
      Cc: Yinghai Lu <yinghai@kernel.org>
      Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      365811d6
    • Bjorn Helgaas's avatar
      x86: print e820 physical addresses consistently with other parts of kernel · 91eb0f67
      Bjorn Helgaas authored
      Print physical address info in a style consistent with the %pR style used
      elsewhere in the kernel.  For example:
      
          -BIOS-provided physical RAM map:
          +e820: BIOS-provided physical RAM map:
          - BIOS-e820: 0000000000000100 - 000000000009e000 (usable)
          +BIOS-e820: [mem 0x0000000000000100-0x000000000009dfff] usable
          -Allocating PCI resources starting at 90000000 (gap: 90000000:6ed1c000)
          +e820: [mem 0x90000000-0xfed1bfff] available for PCI devices
          -reserve RAM buffer: 000000000009e000 - 000000000009ffff
          +e820: reserve RAM buffer [mem 0x0009e000-0x0009ffff]
      Signed-off-by: default avatarBjorn Helgaas <bhelgaas@google.com>
      Cc: Yinghai Lu <yinghai@kernel.org>
      Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      91eb0f67
    • Konstantin Khlebnikov's avatar
      bug: completely remove code generated by disabled VM_BUG_ON() · 02602a18
      Konstantin Khlebnikov authored
      Even if CONFIG_DEBUG_VM=n gcc genereates code for some VM_BUG_ON()
      
      for example VM_BUG_ON(!PageCompound(page) || !PageHead(page)); in
      do_huge_pmd_wp_page() generates 114 bytes of code.
      
      But they mostly disappears when I split this VM_BUG_ON into two:
      
        -VM_BUG_ON(!PageCompound(page) || !PageHead(page));
        +VM_BUG_ON(!PageCompound(page));
        +VM_BUG_ON(!PageHead(page));
      
      weird... but anyway after this patch code disappears completely.
      
        add/remove: 0/0 grow/shrink: 7/97 up/down: 135/-1784 (-1649)
      Signed-off-by: default avatarKonstantin Khlebnikov <khlebnikov@openvz.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Geert Uytterhoeven <geert@linux-m68k.org>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Cong Wang <xiyou.wangcong@gmail.com>
      Cc: Hugh Dickins <hughd@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      02602a18
    • Konstantin Khlebnikov's avatar
      bug: introduce BUILD_BUG_ON_INVALID() macro · baf05aa9
      Konstantin Khlebnikov authored
      Sometimes we want to check some expressions correctness at compile time.
      "(void)(e);" or "if (e);" can be dangerous if the expression has
      side-effects, and gcc sometimes generates a lot of code, even if the
      expression has no effect.
      
      This patch introduces macro BUILD_BUG_ON_INVALID() for such checks, it
      forces a compilation error if expression is invalid without any extra
      code.
      
      [Cast to "long" required because sizeof does not work for bit-fields.]
      Signed-off-by: default avatarKonstantin Khlebnikov <khlebnikov@openvz.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Geert Uytterhoeven <geert@linux-m68k.org>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Cong Wang <xiyou.wangcong@gmail.com>
      Cc: Hugh Dickins <hughd@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      baf05aa9
    • Christopher Yeoh's avatar
      Cross Memory Attach: make it Kconfigurable · 5febcbe9
      Christopher Yeoh authored
      Add a Kconfig option to allow people who don't want cross memory attach to
      not have it included in their build.
      Signed-off-by: default avatarChris Yeoh <yeohc@au1.ibm.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      5febcbe9
    • Johannes Weiner's avatar
      Documentation: memcg: future proof hierarchical statistics documentation · eb6332a5
      Johannes Weiner authored
      The hierarchical versions of per-memcg counters in memory.stat are all
      calculated the same way and are all named total_<counter>.
      
      Documenting the pattern is easier for maintenance than listing each
      counter twice.
      Signed-off-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: default avatarMichal Hocko <mhocko@suse.cz>
      Acked-by: default avatarKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Acked-by: default avatarYing Han <yinghan@google.com>
      Randy Dunlap <rdunlap@xenotime.net>
      Acked-by: default avatarKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      eb6332a5
    • David Rientjes's avatar
      mm, thp: drop page_table_lock to uncharge memcg pages · 6f60b69d
      David Rientjes authored
      mm->page_table_lock is hotly contested for page fault tests and isn't
      necessary to do mem_cgroup_uncharge_page() in do_huge_pmd_wp_page().
      Signed-off-by: default avatarDavid Rientjes <rientjes@google.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Acked-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Reviewed-by: default avatarAndrea Arcangeli <aarcange@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      6f60b69d
    • Ying Han's avatar
      mm: rename is_mlocked_vma() to mlocked_vma_newpage() · 096a7cf4
      Ying Han authored
      Andrew pointed out that the is_mlocked_vma() is misnamed.  A function
      with name like that would expect bool return and no side-effects.
      
      Since it is called on the fault path for new page, rename it in this
      patch.
      Signed-off-by: default avatarYing Han <yinghan@google.com>
      Reviewed-by: default avatarRik van Riel <riel@redhat.com>
      Acked-by: default avatarKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Acked-by: default avatarKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujtisu.com>
      Reviewed-by: default avatarMinchan Kim <minchan@kernel.org>
      [akpm@linux-foundation.org: s/mlock_vma_newpage/mlock_vma_newpage/, per Minchan]
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      096a7cf4
    • Johannes Weiner's avatar
      mm: memcg: count pte references from every member of the reclaimed hierarchy · c3ac9a8a
      Johannes Weiner authored
      The rmap walker checking page table references has historically ignored
      references from VMAs that were not part of the memcg that was being
      reclaimed during memcg hard limit reclaim.
      
      When transitioning global reclaim to memcg hierarchy reclaim, I missed
      that bit and now references from outside a memcg are ignored even during
      global reclaim.
      
      Reverting back to traditional behaviour - count all references during
      global reclaim and only mind references of the memcg being reclaimed
      during limit reclaim would be one option.
      
      However, the more generic idea is to ignore references exactly then when
      they are outside the hierarchy that is currently under reclaim; because
      only then will their reclamation be of any use to help the pressure
      situation.  It makes no sense to ignore references from a sibling memcg
      and then evict a page that will be immediately refaulted by that sibling
      which contributes to the same usage of the common ancestor under
      reclaim.
      
      The solution: make the rmap walker ignore references from VMAs that are
      not part of the hierarchy that is being reclaimed.
      
      Flat limit reclaim will stay the same, hierarchical limit reclaim will
      mind the references only to pages that the hierarchy owns.  Global
      reclaim, since it reclaims from all memcgs, will be fixed to regard all
      references.
      
      [akpm@linux-foundation.org: name the args in the declaration]
      Signed-off-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Reported-by: default avatarKonstantin Khlebnikov <khlebnikov@openvz.org>
      Acked-by: Konstantin Khlebnikov<khlebnikov@openvz.org>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Acked-by: default avatarMichal Hocko <mhocko@suse.cz>
      Cc: Li Zefan <lizf@cn.fujitsu.com>
      Cc: Li Zefan <lizf@cn.fujitsu.com>
      Cc: Tejun Heo <tj@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      c3ac9a8a
    • Johannes Weiner's avatar
      kernel: cgroup: push rcu read locking from css_is_ancestor() to callsite · 91c63734
      Johannes Weiner authored
      Library functions should not grab locks when the callsites can do it,
      even if the lock nests like the rcu read-side lock does.
      
      Push the rcu_read_lock() from css_is_ancestor() to its single user,
      mem_cgroup_same_or_subtree() in preparation for another user that may
      already hold the rcu read-side lock.
      Signed-off-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Cc: Konstantin Khlebnikov <khlebnikov@openvz.org>
      Acked-by: default avatarKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Acked-by: default avatarMichal Hocko <mhocko@suse.cz>
      Acked-by: default avatarLi Zefan <lizf@cn.fujitsu.com>
      Cc: Li Zefan <lizf@cn.fujitsu.com>
      Cc: Tejun Heo <tj@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      91c63734
    • Andrew Morton's avatar
      mm: do_migrate_pages(): rename arguments · 0ce72d4f
      Andrew Morton authored
      s/from_nodes/from and s/to_nodes/to/.  The "_nodes" is redundant - it
      duplicates the argument's type.
      
      Done in a fit of irritation over 80-col issues :(
      
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: KOSAKI Motohiro <mkosaki@redhat.com>
      Cc: Larry Woodman <lwoodman@redhat.com>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Cc: Rik van Riel <riel@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      0ce72d4f
    • Larry Woodman's avatar
      mm: do_migrate_pages() calls migrate_to_node() even if task is already on a correct node · 4a5b18cc
      Larry Woodman authored
      While running an application that moves tasks from one cpuset to another
      I noticed that it takes much longer and moves many more pages than
      expected.
      
      The reason for this is do_migrate_pages() does its best to preserve the
      relative node differential from the first node of the cpuset because the
      application may have been written with that in mind.  If memory was
      interleaved on the nodes of the source cpuset by an application
      do_migrate_pages() will try its best to maintain that interleaving on
      the nodes of the destination cpuset.  This means copying the memory from
      all source nodes to the destination nodes even if the source and
      destination nodes overlap.
      
      This is a problem for userspace NUMA placement tools.  The amount of
      time spent doing extra memory moves cancels out some of the NUMA
      performance improvements.  Furthermore, if the number of source and
      destination nodes are to maintain the previous interleaving layout
      anyway.
      
      This patch changes do_migrate_pages() to only preserve the relative
      layout inside the program if the number of NUMA nodes in the source and
      destination mask are the same.  If the number is different, we do a much
      more efficient migration by not touching memory that is in an allowed
      node.
      
      This preserves the old behaviour for programs that want it, while
      allowing a userspace NUMA placement tool to use the new, faster
      migration.  This improves performance in our tests by up to a factor of
      7.
      
      Without this change migrating tasks from a cpuset containing nodes 0-7
      to a cpuset containing nodes 3-4, we migrate from ALL the nodes even if
      they are in the both the source and destination nodesets:
      
         Migrating 7 to 4
         Migrating 6 to 3
         Migrating 5 to 4
         Migrating 4 to 3
         Migrating 1 to 4
         Migrating 3 to 4
         Migrating 0 to 3
         Migrating 2 to 3
      
      With this change we only migrate from nodes that are not in the
      destination nodesets:
      
         Migrating 7 to 4
         Migrating 6 to 3
         Migrating 5 to 4
         Migrating 2 to 3
         Migrating 1 to 4
         Migrating 0 to 3
      
      Yet if we move from a cpuset containing nodes 2,3,4 to a cpuset
      containing 3,4,5 we still do move everything so that we preserve the
      desired NUMA offsets:
      
         Migrating 4 to 5
         Migrating 3 to 4
         Migrating 2 to 3
      
      As far as performance is concerned this simple patch improves the time
      it takes to move 14, 20 and 26 large tasks from a cpuset containing
      nodes 0-7 to a cpuset containing nodes 1 & 3 by up to a factor of 7.
      Here are the timings with and without the patch:
      
      BEFORE PATCH -- Move times: 59, 140, 651 seconds
      ============
      
        Moving 14 tasks from nodes (0-7) to nodes (1,3)
        numad(8780) do_migrate_pages (mm=0xffff88081d414400
        from_nodes=0xffff880818c81d28 to_nodes=0xffff880818c81ce8 flags=0x4)
        numad(8780) migrate_to_node (mm=0xffff88081d414400 source=0x7 dest=0x3 flags=0x4)
        numad(8780) migrate_to_node (mm=0xffff88081d414400 source=0x6 dest=0x1 flags=0x4)
        numad(8780) migrate_to_node (mm=0xffff88081d414400 source=0x5 dest=0x3 flags=0x4)
        numad(8780) migrate_to_node (mm=0xffff88081d414400 source=0x4 dest=0x1 flags=0x4)
        numad(8780) migrate_to_node (mm=0xffff88081d414400 source=0x2 dest=0x1 flags=0x4)
        numad(8780) migrate_to_node (mm=0xffff88081d414400 source=0x1 dest=0x3 flags=0x4)
        numad(8780) migrate_to_node (mm=0xffff88081d414400 source=0x0 dest=0x1 flags=0x4)
        (Above moves repeated for each of the 14 tasks...)
        PID 8890 moved to node(s) 1,3 in 59.2 seconds
      
        Moving 20 tasks from nodes (0-7) to nodes (1,4-5)
        numad(8780) do_migrate_pages (mm=0xffff88081d88c700
        from_nodes=0xffff880818c81d28 to_nodes=0xffff880818c81ce8 flags=0x4)
        numad(8780) migrate_to_node (mm=0xffff88081d88c700 source=0x7 dest=0x4 flags=0x4)
        numad(8780) migrate_to_node (mm=0xffff88081d88c700 source=0x6 dest=0x1 flags=0x4)
        numad(8780) migrate_to_node (mm=0xffff88081d88c700 source=0x3 dest=0x1 flags=0x4)
        numad(8780) migrate_to_node (mm=0xffff88081d88c700 source=0x2 dest=0x5 flags=0x4)
        numad(8780) migrate_to_node (mm=0xffff88081d88c700 source=0x1 dest=0x4 flags=0x4)
        numad(8780) migrate_to_node (mm=0xffff88081d88c700 source=0x0 dest=0x1 flags=0x4)
        (Above moves repeated for each of the 20 tasks...)
        PID 8962 moved to node(s) 1,4-5 in 139.88 seconds
      
        Moving 26 tasks from nodes (0-7) to nodes (1-3,5)
        numad(8780) do_migrate_pages (mm=0xffff88081d5bc740
        from_nodes=0xffff880818c81d28 to_nodes=0xffff880818c81ce8 flags=0x4)
        numad(8780) migrate_to_node (mm=0xffff88081d5bc740 source=0x7 dest=0x5 flags=0x4)
        numad(8780) migrate_to_node (mm=0xffff88081d5bc740 source=0x6 dest=0x3 flags=0x4)
        numad(8780) migrate_to_node (mm=0xffff88081d5bc740 source=0x5 dest=0x2 flags=0x4)
        numad(8780) migrate_to_node (mm=0xffff88081d5bc740 source=0x3 dest=0x5 flags=0x4)
        numad(8780) migrate_to_node (mm=0xffff88081d5bc740 source=0x2 dest=0x3 flags=0x4)
        numad(8780) migrate_to_node (mm=0xffff88081d5bc740 source=0x1 dest=0x2 flags=0x4)
        numad(8780) migrate_to_node (mm=0xffff88081d5bc740 source=0x0 dest=0x1 flags=0x4)
        numad(8780) migrate_to_node (mm=0xffff88081d5bc740 source=0x4 dest=0x1 flags=0x4)
        (Above moves repeated for each of the 26 tasks...)
        PID 9058 moved to node(s) 1-3,5 in 651.45 seconds
      
      AFTER PATCH -- Move times: 42, 56, 93 seconds
      ===========
      
        Moving 14 tasks from nodes (0-7) to nodes (5,7)
        numad(33209) do_migrate_pages (mm=0xffff88101d5ff140
        from_nodes=0xffff88101e7b5d28 to_nodes=0xffff88101e7b5ce8 flags=0x4)
        numad(33209) migrate_to_node (mm=0xffff88101d5ff140 source=0x6 dest=0x5 flags=0x4)
        numad(33209) migrate_to_node (mm=0xffff88101d5ff140 source=0x4 dest=0x5 flags=0x4)
        numad(33209) migrate_to_node (mm=0xffff88101d5ff140 source=0x3 dest=0x7 flags=0x4)
        numad(33209) migrate_to_node (mm=0xffff88101d5ff140 source=0x2 dest=0x5 flags=0x4)
        numad(33209) migrate_to_node (mm=0xffff88101d5ff140 source=0x1 dest=0x7 flags=0x4)
        numad(33209) migrate_to_node (mm=0xffff88101d5ff140 source=0x0 dest=0x5 flags=0x4)
        (Above moves repeated for each of the 14 tasks...)
        PID 33221 moved to node(s) 5,7 in 41.67 seconds
      
        Moving 20 tasks from nodes (0-7) to nodes (1,3,5)
        numad(33209) do_migrate_pages (mm=0xffff88101d6c37c0
        from_nodes=0xffff88101e7b5d28 to_nodes=0xffff88101e7b5ce8 flags=0x4)
        numad(33209) migrate_to_node (mm=0xffff88101d6c37c0 source=0x7 dest=0x3 flags=0x4)
        numad(33209) migrate_to_node (mm=0xffff88101d6c37c0 source=0x6 dest=0x1 flags=0x4)
        numad(33209) migrate_to_node (mm=0xffff88101d6c37c0 source=0x4 dest=0x3 flags=0x4)
        numad(33209) migrate_to_node (mm=0xffff88101d6c37c0 source=0x2 dest=0x5 flags=0x4)
        numad(33209) migrate_to_node (mm=0xffff88101d6c37c0 source=0x0 dest=0x1 flags=0x4)
        (Above moves repeated for each of the 20 tasks...)
        PID 33289 moved to node(s) 1,3,5 in 56.3 seconds
      
        Moving 26 tasks from nodes (0-7) to nodes (1,3,5,7)
        numad(33209) do_migrate_pages (mm=0xffff88101d924400
        from_nodes=0xffff88101e7b5d28 to_nodes=0xffff88101e7b5ce8 flags=0x4)
        numad(33209) migrate_to_node (mm=0xffff88101d924400 source=0x6 dest=0x5 flags=0x4)
        numad(33209) migrate_to_node (mm=0xffff88101d924400 source=0x4 dest=0x1 flags=0x4)
        numad(33209) migrate_to_node (mm=0xffff88101d924400 source=0x2 dest=0x5 flags=0x4)
        numad(33209) migrate_to_node (mm=0xffff88101d924400 source=0x0 dest=0x1 flags=0x4)
        (Above moves repeated for each of the 26 tasks...)
        PID 33372 moved to node(s) 1,3,5,7 in 92.67 seconds
      
      [akpm@linux-foundation.org: clean up comment layout]
      Signed-off-by: default avatarLarry Woodman <lwoodman@redhat.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Acked-by: default avatarKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Reviewed-by: default avatarRik van Riel <riel@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      4a5b18cc
    • David Rientjes's avatar
      thp, memcg: split hugepage for memcg oom on cow · 1f1d06c3
      David Rientjes authored
      On COW, a new hugepage is allocated and charged to the memcg.  If the
      system is oom or the charge to the memcg fails, however, the fault
      handler will return VM_FAULT_OOM which results in an oom kill.
      
      Instead, it's possible to fallback to splitting the hugepage so that the
      COW results only in an order-0 page being allocated and charged to the
      memcg which has a higher liklihood to succeed.  This is expensive
      because the hugepage must be split in the page fault handler, but it is
      much better than unnecessarily oom killing a process.
      Signed-off-by: default avatarDavid Rientjes <rientjes@google.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Johannes Weiner <jweiner@redhat.com>
      Acked-by: default avatarKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Michal Hocko <mhocko@suse.cz>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      1f1d06c3