1. 15 Dec, 2009 40 commits
    • KOSAKI Motohiro's avatar
      vmscan: zone_reclaim() don't use insane swap_cluster_max · 4f0ddfdf
      KOSAKI Motohiro authored
      In old days, we didn't have sc.nr_to_reclaim and it brought
      sc.swap_cluster_max misuse.
      
      huge sc.swap_cluster_max might makes unnecessary OOM risk and no
      performance benefit.
      
      Now, we can stop its insane thing.
      Signed-off-by: default avatarKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Reviewed-by: default avatarRik van Riel <riel@redhat.com>
      Reviewed-by: default avatarMinchan Kim <minchan.kim@gmail.com>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      4f0ddfdf
    • KOSAKI Motohiro's avatar
      vmscan: kill hibernation specific reclaim logic and unify it · 7b51755c
      KOSAKI Motohiro authored
      shrink_all_zone() was introduced by commit d6277db4 (swsusp: rework
      memory shrinker) for hibernate performance improvement.  and
      sc.swap_cluster_max was introduced by commit a06fe4d307 (Speed freeing
      memory for suspend).
      
      commit a06fe4d307 said
      
         Without the patch:
         Freed  14600 pages in  1749 jiffies = 32.61 MB/s (Anomolous!)
         Freed  88563 pages in 14719 jiffies = 23.50 MB/s
         Freed 205734 pages in 32389 jiffies = 24.81 MB/s
      
         With the patch:
         Freed  68252 pages in   496 jiffies = 537.52 MB/s
         Freed 116464 pages in   569 jiffies = 798.54 MB/s
         Freed 209699 pages in   705 jiffies = 1161.89 MB/s
      
      At that time, their patch was pretty worth.  However, Modern Hardware
      trend and recent VM improvement broke its worth.  From several reason, I
      think we should remove shrink_all_zones() at all.
      
      detail:
      
      1) Old days, shrink_zone()'s slowness was mainly caused by stupid io-throttle
        at no i/o congestion.
        but current shrink_zone() is sane, not slow.
      
      2) shrink_all_zone() try to shrink all pages at a time. but it doesn't works
        fine on numa system.
        example)
          System has 4GB memory and each node have 2GB. and hibernate need 1GB.
      
          optimal)
             steal 500MB from each node.
          shrink_all_zones)
             steal 1GB from node-0.
      
        Oh, Cache balancing logic was broken. ;)
        Unfortunately, Desktop system moved ahead NUMA at nowadays.
        (Side note, if hibernate require 2GB, shrink_all_zones() never success
         on above machine)
      
      3) if the node has several I/O flighting pages, shrink_all_zones() makes
        pretty bad result.
      
        schenario) hibernate need 1GB
      
        1) shrink_all_zones() try to reclaim 1GB from Node-0
        2) but it only reclaimed 990MB
        3) stupidly, shrink_all_zones() try to reclaim 1GB from Node-1
        4) it reclaimed 990MB
      
        Oh, well. it reclaimed twice much than required.
        In the other hand, current shrink_zone() has sane baling out logic.
        then, it doesn't make overkill reclaim. then, we lost shrink_zones()'s risk.
      
      4) SplitLRU VM always keep active/inactive ratio very carefully. inactive list only
        shrinking break its assumption. it makes unnecessary OOM risk. it obviously suboptimal.
      
      Now, shrink_all_memory() is only the wrapper function of do_try_to_free_pages().
      it bring good reviewability and debuggability, and solve above problems.
      
      side note: Reclaim logic unificication makes two good side effect.
       - Fix recursive reclaim bug on shrink_all_memory().
         it did forgot to use PF_MEMALLOC. it mean the system be able to stuck into deadlock.
       - Now, shrink_all_memory() got lockdep awareness. it bring good debuggability.
      Signed-off-by: default avatarKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Reviewed-by: default avatarRik van Riel <riel@redhat.com>
      Acked-by: default avatarRafael J. Wysocki <rjw@sisk.pl>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      7b51755c
    • KOSAKI Motohiro's avatar
      vmscan: separate sc.swap_cluster_max and sc.nr_max_reclaim · 22fba335
      KOSAKI Motohiro authored
      Currently, sc.scap_cluster_max has double meanings.
      
       1) reclaim batch size as isolate_lru_pages()'s argument
       2) reclaim baling out thresolds
      
      The two meanings pretty unrelated. Thus, Let's separate it.
      this patch doesn't change any behavior.
      Signed-off-by: default avatarKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Reviewed-by: default avatarRik van Riel <riel@redhat.com>
      Reviewed-by: default avatarMinchan Kim <minchan.kim@gmail.com>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      22fba335
    • Alex Chiang's avatar
      Documentation: ABI: /sys/devices/system/cpu/cpu#/node · cba5dd7f
      Alex Chiang authored
      Describe NUMA node symlink created for CPUs when CONFIG_NUMA is set.
      Signed-off-by: default avatarAlex Chiang <achiang@hp.com>
      Cc: Greg KH <greg@kroah.com>
      Cc: Randy Dunlap <randy.dunlap@oracle.com>
      Cc: Gary Hade <garyhade@us.ibm.com>
      Cc: Badari Pulavarty <pbadari@us.ibm.com>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: David Rientjes <rientjes@google.com>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      cba5dd7f
    • Alex Chiang's avatar
      mm: add numa node symlink for cpu devices in sysfs · 1830794a
      Alex Chiang authored
      You can discover which CPUs belong to a NUMA node by examining
      /sys/devices/system/node/node#/
      
      However, it's not convenient to go in the other direction, when looking at
      /sys/devices/system/cpu/cpu#/
      
      Yes, you can muck about in sysfs, but adding these symlinks makes life a
      lot more convenient.
      Signed-off-by: default avatarAlex Chiang <achiang@hp.com>
      Acked-by: default avatarDavid Rientjes <rientjes@google.com>
      Cc: Gary Hade <garyhade@us.ibm.com>
      Cc: Badari Pulavarty <pbadari@us.ibm.com>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Greg KH <greg@kroah.com>
      Cc: Randy Dunlap <randy.dunlap@oracle.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      1830794a
    • Alex Chiang's avatar
      mm: refactor unregister_cpu_under_node() · b9d52dad
      Alex Chiang authored
      By returning early if the node is not online, we can unindent the
      interesting code by two levels.
      
      No functional change.
      Signed-off-by: default avatarAlex Chiang <achiang@hp.com>
      Cc: Gary Hade <garyhade@us.ibm.com>
      Cc: Badari Pulavarty <pbadari@us.ibm.com>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Greg KH <greg@kroah.com>
      Cc: Randy Dunlap <randy.dunlap@oracle.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      b9d52dad
    • Alex Chiang's avatar
      mm: refactor register_cpu_under_node() · f8246f31
      Alex Chiang authored
      By returning early if the node is not online, we can unindent the
      interesting code by one level.
      
      No functional change.
      Signed-off-by: default avatarAlex Chiang <achiang@hp.com>
      Cc: Gary Hade <garyhade@us.ibm.com>
      Cc: Badari Pulavarty <pbadari@us.ibm.com>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Greg KH <greg@kroah.com>
      Cc: Randy Dunlap <randy.dunlap@oracle.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      f8246f31
    • Alex Chiang's avatar
      mm: add numa node symlink for memory section in sysfs · dee5d0d5
      Alex Chiang authored
      Commit c04fc586 (mm: show node to memory section relationship with
      symlinks in sysfs) created symlinks from nodes to memory sections, e.g.
      
      /sys/devices/system/node/node1/memory135 -> ../../memory/memory135
      
      If you're examining the memory section though and are wondering what node
      it might belong to, you can find it by grovelling around in sysfs, but
      it's a little cumbersome.
      
      Add a reverse symlink for each memory section that points back to the
      node to which it belongs.
      Signed-off-by: default avatarAlex Chiang <achiang@hp.com>
      Cc: Gary Hade <garyhade@us.ibm.com>
      Cc: Badari Pulavarty <pbadari@us.ibm.com>
      Cc: Ingo Molnar <mingo@elte.hu>
      Acked-by: default avatarDavid Rientjes <rientjes@google.com>
      Cc: Greg KH <greg@kroah.com>
      Cc: Randy Dunlap <randy.dunlap@oracle.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      dee5d0d5
    • Hugh Dickins's avatar
      mm: sigbus instead of abusing oom · d99be1a8
      Hugh Dickins authored
      When do_nonlinear_fault() realizes that the page table must have been
      corrupted for it to have been called, it does print_bad_pte() and returns
      ...  VM_FAULT_OOM, which is hard to understand.
      
      It made some sense when I did it for 2.6.15, when do_page_fault() just
      killed the current process; but nowadays it lets the OOM killer decide who
      to kill - so page table corruption in one process would be liable to kill
      another.
      
      Change it to return VM_FAULT_SIGBUS instead: that doesn't guarantee that
      the process will be killed, but is good enough for such a rare
      abnormality, accompanied as it is by the "BUG: Bad page map" message.
      
      And recent HWPOISON work has copied that code into do_swap_page(), when it
      finds an impossible swap entry: fix that to VM_FAULT_SIGBUS too.
      Signed-off-by: default avatarHugh Dickins <hugh.dickins@tiscali.co.uk>
      Cc: Izik Eidus <ieidus@redhat.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Nick Piggin <npiggin@suse.de>
      Reviewed-by: default avatarKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com>
      Cc: Andi Kleen <andi@firstfloor.org>
      Reviewed-by: default avatarKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Reviewed-by: default avatarWu Fengguang <fengguang.wu@intel.com>
      Reviewed-by: default avatarMinchan Kim <minchan.kim@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      d99be1a8
    • Hugh Dickins's avatar
      mm: stop ptlock enlarging struct page · a70caa8b
      Hugh Dickins authored
      CONFIG_DEBUG_SPINLOCK adds 12 or 16 bytes to a 32- or 64-bit spinlock_t,
      and CONFIG_DEBUG_LOCK_ALLOC adds another 12 or 24 bytes to it: lockdep
      enables both of those, and CONFIG_LOCK_STAT adds 8 or 16 bytes to that.
      
      When 2.6.15 placed the split page table lock inside struct page (usually
      sized 32 or 56 bytes), only CONFIG_DEBUG_SPINLOCK was a possibility, and
      we ignored the enlargement (but fitted in CONFIG_GENERIC_LOCKBREAK's 4 by
      letting the spinlock_t occupy both page->private and page->mapping).
      
      Should these debugging options be allowed to double the size of a struct
      page, when only one minority use of the page (as a page table) needs to
      fit a spinlock in there?  Perhaps not.
      
      Take the easy way out: switch off SPLIT_PTLOCK_CPUS when DEBUG_SPINLOCK or
      DEBUG_LOCK_ALLOC is in force.  I've sometimes tried to be cleverer,
      kmallocing a cacheline for the spinlock when it doesn't fit, but given up
      each time.  Falling back to mm->page_table_lock (as we do when ptlock is
      not split) lets lockdep check out the strictest path anyway.
      
      And now that some arches allow 8192 cpus, use 999999 for infinity.
      
      (What has this got to do with KSM swapping?  It doesn't care about the
      size of struct page, but may care about random junk in page->mapping - to
      be explained separately later.)
      Signed-off-by: default avatarHugh Dickins <hugh.dickins@tiscali.co.uk>
      Cc: Izik Eidus <ieidus@redhat.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Nick Piggin <npiggin@suse.de>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com>
      Cc: Andi Kleen <andi@firstfloor.org>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Wu Fengguang <fengguang.wu@intel.com>
      Cc: Minchan Kim <minchan.kim@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      a70caa8b
    • Hugh Dickins's avatar
      mm: pass address down to rmap ones · 1cb1729b
      Hugh Dickins authored
      KSM swapping will know where page_referenced_one() and try_to_unmap_one()
      should look.  It could hack page->index to get them to do what it wants,
      but it seems cleaner now to pass the address down to them.
      
      Make the same change to page_mkclean_one(), since it follows the same
      pattern; but there's no real need in its case.
      Signed-off-by: default avatarHugh Dickins <hugh.dickins@tiscali.co.uk>
      Cc: Izik Eidus <ieidus@redhat.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Nick Piggin <npiggin@suse.de>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com>
      Cc: Andi Kleen <andi@firstfloor.org>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Wu Fengguang <fengguang.wu@intel.com>
      Cc: Minchan Kim <minchan.kim@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      1cb1729b
    • Hugh Dickins's avatar
      mm: CONFIG_MMU for PG_mlocked · af8e3354
      Hugh Dickins authored
      Remove three degrees of obfuscation, left over from when we had
      CONFIG_UNEVICTABLE_LRU.  MLOCK_PAGES is CONFIG_HAVE_MLOCKED_PAGE_BIT is
      CONFIG_HAVE_MLOCK is CONFIG_MMU.  rmap.o (and memory-failure.o) are only
      built when CONFIG_MMU, so don't need such conditions at all.
      
      Somehow, I feel no compulsion to remove the CONFIG_HAVE_MLOCK* lines from
      169 defconfigs: leave those to evolve in due course.
      Signed-off-by: default avatarHugh Dickins <hugh.dickins@tiscali.co.uk>
      Cc: Izik Eidus <ieidus@redhat.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Nick Piggin <npiggin@suse.de>
      Reviewed-by: default avatarKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com>
      Cc: Andi Kleen <andi@firstfloor.org>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Wu Fengguang <fengguang.wu@intel.com>
      Cc: Minchan Kim <minchan.kim@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      af8e3354
    • Hugh Dickins's avatar
      mm: mlocking in try_to_unmap_one · 53f79acb
      Hugh Dickins authored
      There's contorted mlock/munlock handling in try_to_unmap_anon() and
      try_to_unmap_file(), which we'd prefer not to repeat for KSM swapping.
      Simplify it by moving it all down into try_to_unmap_one().
      
      One thing is then lost, try_to_munlock()'s distinction between when no vma
      holds the page mlocked, and when a vma does mlock it, but we could not get
      mmap_sem to set the page flag.  But its only caller takes no interest in
      that distinction (and is better testing SWAP_MLOCK anyway), so let's keep
      the code simple and return SWAP_AGAIN for both cases.
      
      try_to_unmap_file()'s TTU_MUNLOCK nonlinear handling was particularly
      amusing: once unravelled, it turns out to have been choosing between two
      different ways of doing the same nothing.  Ah, no, one way was actually
      returning SWAP_FAIL when it meant to return SWAP_SUCCESS.
      
      [kosaki.motohiro@jp.fujitsu.com: comment adding to mlocking in try_to_unmap_one]
      [akpm@linux-foundation.org: remove test of MLOCK_PAGES]
      Signed-off-by: default avatarHugh Dickins <hugh.dickins@tiscali.co.uk>
      Cc: Izik Eidus <ieidus@redhat.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Nick Piggin <npiggin@suse.de>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com>
      Cc: Andi Kleen <andi@firstfloor.org>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Wu Fengguang <fengguang.wu@intel.com>
      Cc: Minchan Kim <minchan.kim@gmail.com>
      Signed-off-by: default avatarKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Andi Kleen <andi@firstfloor.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      53f79acb
    • Hugh Dickins's avatar
      mm: define PAGE_MAPPING_FLAGS · 3ca7b3c5
      Hugh Dickins authored
      At present we define PageAnon(page) by the low PAGE_MAPPING_ANON bit set
      in page->mapping, with the higher bits a pointer to the anon_vma; and have
      defined PageKsm(page) as that with NULL anon_vma.
      
      But KSM swapping will need to store a pointer there: so in preparation for
      that, now define PAGE_MAPPING_FLAGS as the low two bits, including
      PAGE_MAPPING_KSM (always set along with PAGE_MAPPING_ANON, until some
      other use for the bit emerges).
      
      Declare page_rmapping(page) to return the pointer part of page->mapping,
      and page_anon_vma(page) to return the anon_vma pointer when that's what it
      is.  Use these in a few appropriate places: notably, unuse_vma() has been
      testing page->mapping, but is better to be testing page_anon_vma() (cases
      may be added in which flag bits are set without any pointer).
      Signed-off-by: default avatarHugh Dickins <hugh.dickins@tiscali.co.uk>
      Cc: Izik Eidus <ieidus@redhat.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Nick Piggin <npiggin@suse.de>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Reviewed-by: default avatarRik van Riel <riel@redhat.com>
      Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com>
      Cc: Andi Kleen <andi@firstfloor.org>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Wu Fengguang <fengguang.wu@intel.com>
      Cc: Minchan Kim <minchan.kim@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      3ca7b3c5
    • KOSAKI Motohiro's avatar
      vmscan: stop kswapd waiting on congestion when the min watermark is not being met · bb3ab596
      KOSAKI Motohiro authored
      If reclaim fails to make sufficient progress, the priority is raised.
      Once the priority is higher, kswapd starts waiting on congestion.
      However, if the zone is below the min watermark then kswapd needs to
      continue working without delay as there is a danger of an increased rate
      of GFP_ATOMIC allocation failure.
      
      This patch changes the conditions under which kswapd waits on congestion
      by only going to sleep if the min watermarks are being met.
      
      [mel@csn.ul.ie: add stats to track how relevant the logic is]
      [mel@csn.ul.ie: make kswapd only check its own zones and rename the relevant counters]
      Signed-off-by: default avatarKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Signed-off-by: default avatarMel Gorman <mel@csn.ul.ie>
      Reviewed-by: default avatarRik van Riel <riel@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      bb3ab596
    • Mel Gorman's avatar
      vmscan: have kswapd sleep for a short interval and double check it should be asleep · f50de2d3
      Mel Gorman authored
      After kswapd balances all zones in a pgdat, it goes to sleep.  In the
      event of no IO congestion, kswapd can go to sleep very shortly after the
      high watermark was reached.  If there are a constant stream of allocations
      from parallel processes, it can mean that kswapd went to sleep too quickly
      and the high watermark is not being maintained for sufficient length time.
      
      This patch makes kswapd go to sleep as a two-stage process.  It first
      tries to sleep for HZ/10.  If it is woken up by another process or the
      high watermark is no longer met, it's considered a premature sleep and
      kswapd continues work.  Otherwise it goes fully to sleep.
      
      This adds more counters to distinguish between fast and slow breaches of
      watermarks.  A "fast" premature sleep is one where the low watermark was
      hit in a very short time after kswapd going to sleep.  A "slow" premature
      sleep indicates that the high watermark was breached after a very short
      interval.
      Signed-off-by: default avatarMel Gorman <mel@csn.ul.ie>
      Cc: Frans Pop <elendil@planet.nl>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Christoph Lameter <cl@linux-foundation.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      f50de2d3
    • Huang Shijie's avatar
      rmap: move label `out' to a better place · 273f047e
      Huang Shijie authored
      When the code jumps to the `out', `referenced' is still zero.  So there is
      no need to check it.
      Signed-off-by: default avatarHuang Shijie <shijie8@gmail.com>
      Acked-by: default avatarWu Fengguang <fengguang.wu@intel.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      273f047e
    • Huang Shijie's avatar
      rmap: simplify try_to_unmap_file() · 7b511594
      Huang Shijie authored
      Just simplify the code when `mlocked' is true.
      Signed-off-by: default avatarHuang Shijie <shijie8@gmail.com>
      Reviewed-by: default avatarKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      7b511594
    • Huang Shijie's avatar
      rmap: fix the comment for try_to_unmap_anon · 8051be5e
      Huang Shijie authored
      Fix the comment for try_to_unmap_anon() with the new arguments.
      Signed-off-by: default avatarHuang Shijie <shijie8@gmail.com>
      Acked-by: default avatarHugh Dickins <hugh.dickins@tiscali.co.uk>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      8051be5e
    • Vincent Li's avatar
      mm/vmscan: change comment generic_file_write to __generic_file_aio_write · 6aceb53b
      Vincent Li authored
      Commit 543ade1f ("Streamline generic_file_* interfaces and filemap
      cleanups") removed generic_file_write() in filemap.  Change the comment in
      vmscan pageout() to __generic_file_aio_write().
      Signed-off-by: default avatarVincent Li <macli@brc.ubc.ca>
      Cc: Christoph Hellwig <hch@lst.de>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      6aceb53b
    • Lee Schermerhorn's avatar
      swap: rework map_swap_page() again · d4906e1a
      Lee Schermerhorn authored
      Seems that page_io.c doesn't really need to know that page_private(page)
      is the swp_entry 'val'.  Rework map_swap_page() to do what its name says
      and map a page to a page offset in the swap space.
      
      The only other caller of map_swap_page() is internal to mm/swapfile.c and
      it does want to map a swap entry to the 'sector'.  So rename
      map_swap_page() to map_swap_entry(), make it 'static' and and implement
      map_swap_page() as a wrapper around that.
      Signed-off-by: default avatarLee Schermerhorn <lee.schermerhorn@hp.com>
      Cc: Hugh Dickins <hugh.dickins@tiscali.co.uk>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      d4906e1a
    • Hugh Dickins's avatar
      swap_info: reorder its fields · 7509765a
      Hugh Dickins authored
      Reorder (and comment) the fields of swap_info_struct, to make better
      use of its cachelines: it's good for swap_duplicate() in particular
      if unsigned int max and swap_map are near the start.
      Signed-off-by: default avatarHugh Dickins <hugh.dickins@tiscali.co.uk>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Rik van Riel <riel@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      7509765a
    • Hugh Dickins's avatar
      swap_info: note SWAP_MAP_SHMEM · aaa46865
      Hugh Dickins authored
      While we're fiddling with the swap_map values, let's assign a particular
      value to shmem/tmpfs swap pages: their swap counts are never incremented,
      and it helps swapoff's try_to_unuse() a little if it can immediately
      distinguish those pages from process pages.
      
      Since we've no use for SWAP_MAP_BAD | COUNT_CONTINUED,
      we might as well use that 0xbf value for SWAP_MAP_SHMEM.
      Signed-off-by: default avatarHugh Dickins <hugh.dickins@tiscali.co.uk>
      Reviewed-by: default avatarKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Rik van Riel <riel@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      aaa46865
    • Hugh Dickins's avatar
      swap_info: swap count continuations · 570a335b
      Hugh Dickins authored
      Swap is duplicated (reference count incremented by one) whenever the same
      swap page is inserted into another mm (when forking finds a swap entry in
      place of a pte, or when reclaim unmaps a pte to insert the swap entry).
      
      swap_info_struct's vmalloc'ed swap_map is the array of these reference
      counts: but what happens when the unsigned short (or unsigned char since
      the preceding patch) is full? (and its high bit is kept for a cache flag)
      
      We then lose track of it, never freeing, leaving it in use until swapoff:
      at which point we _hope_ that a single pass will have found all instances,
      assume there are no more, and will lose user data if we're wrong.
      
      Swapping of KSM pages has not yet been enabled; but it is implemented,
      and makes it very easy for a user to overflow the maximum swap count:
      possible with ordinary process pages, but unlikely, even when pid_max
      has been raised from PID_MAX_DEFAULT.
      
      This patch implements swap count continuations: when the count overflows,
      a continuation page is allocated and linked to the original vmalloc'ed
      map page, and this used to hold the continuation counts for that entry
      and its neighbours.  These continuation pages are seldom referenced:
      the common paths all work on the original swap_map, only referring to
      a continuation page when the low "digit" of a count is incremented or
      decremented through SWAP_MAP_MAX.
      Signed-off-by: default avatarHugh Dickins <hugh.dickins@tiscali.co.uk>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Rik van Riel <riel@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      570a335b
    • Hugh Dickins's avatar
      swap_info: swap_map of chars not shorts · 8d69aaee
      Hugh Dickins authored
      Halve the vmalloc'ed swap_map array from unsigned shorts to unsigned
      chars: it's still very unusual to reach a swap count of 126, and the
      next patch allows it to be extended indefinitely.
      Signed-off-by: default avatarHugh Dickins <hugh.dickins@tiscali.co.uk>
      Reviewed-by: default avatarKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Rik van Riel <riel@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      8d69aaee
    • Hugh Dickins's avatar
      swap_info: SWAP_HAS_CACHE cleanups · 253d553b
      Hugh Dickins authored
      Though swap_count() is useful, I'm finding that swap_has_cache() and
      encode_swapmap() obscure what happens in the swap_map entry, just at
      those points where I need to understand it.  Remove them, and pass
      more usable "usage" values to scan_swap_map(), swap_entry_free() and
      __swap_duplicate(), instead of the SWAP_MAP and SWAP_CACHE enum.
      Signed-off-by: default avatarHugh Dickins <hugh.dickins@tiscali.co.uk>
      Reviewed-by: default avatarKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Rik van Riel <riel@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      253d553b
    • Hugh Dickins's avatar
      swap_info: miscellaneous minor cleanups · 73c34b6a
      Hugh Dickins authored
      Move CONFIG_HIBERNATION's swapdev_block() into the main CONFIG_HIBERNATION
      block, remove extraneous whitespace and return, fix typo in a comment.
      Signed-off-by: default avatarHugh Dickins <hugh.dickins@tiscali.co.uk>
      Reviewed-by: default avatarKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Rik van Riel <riel@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      73c34b6a
    • Hugh Dickins's avatar
      swap_info: include first_swap_extent · 9625a5f2
      Hugh Dickins authored
      Make better use of the space by folding first swap_extent into its
      swap_info_struct, instead of just the list_head: swap partitions need
      only that one, and for others it's used as a circular list anyway.
      
      [jirislaby@gmail.com: fix crash on double swapon]
      Signed-off-by: default avatarHugh Dickins <hugh.dickins@tiscali.co.uk>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Rik van Riel <riel@redhat.com>
      Signed-off-by: default avatarJiri Slaby <jirislaby@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      9625a5f2
    • Hugh Dickins's avatar
      swap_info: change to array of pointers · efa90a98
      Hugh Dickins authored
      The swap_info_struct is only 76 or 104 bytes, but it does seem wrong
      to reserve an array of about 30 of them in bss, when most people will
      want only one.  Change swap_info[] to an array of pointers.
      
      That does need a "type" field in the structure: pack it as a char with
      next type and short prio (aha, char is unsigned by default on PowerPC).
      Use the (admittedly peculiar) name "type" throughout for this index.
      
      /proc/swaps does not take swap_lock: I wouldn't want it to, but do take
      care with barriers when adding a new item to the array (never removed).
      Signed-off-by: default avatarHugh Dickins <hugh.dickins@tiscali.co.uk>
      Reviewed-by: default avatarKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Acked-by: default avatarRik van Riel <riel@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      efa90a98
    • Hugh Dickins's avatar
      swap_info: private to swapfile.c · f29ad6a9
      Hugh Dickins authored
      The swap_info_struct is mostly private to mm/swapfile.c, with only
      one other in-tree user: get_swap_bio().  Adjust its interface to
      map_swap_page(), so that we can then remove get_swap_info_struct().
      
      But there is a popular user out-of-tree, TuxOnIce: so leave the
      declaration of swap_info_struct in linux/swap.h.
      Signed-off-by: default avatarHugh Dickins <hugh.dickins@tiscali.co.uk>
      Cc: Nigel Cunningham <ncunningham@crca.org.au>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Reviewed-by: default avatarRik van Riel <riel@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      f29ad6a9
    • Jan Beulich's avatar
      vmalloc(): adjust gfp mask passed on nested vmalloc() invocation · 976d6dfb
      Jan Beulich authored
      - avoid wasting more precious resources (DMA or DMA32 pools), when
        being called through vmalloc_32{,_user}()
      - explicitly allow using high memory here even if the outer allocation
        request doesn't allow it
      Signed-off-by: default avatarJan Beulich <jbeulich@novell.com>
      Acked-by: default avatarHugh Dickins <hugh.dickins@tiscali.co.uk>
      Cc: Nick Piggin <nickpiggin@yahoo.com.au>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      976d6dfb
    • David Rientjes's avatar
      mm: add gfp flags for NODEMASK_ALLOC slab allocations · bad44b5b
      David Rientjes authored
      Objects passed to NODEMASK_ALLOC() are relatively small in size and are
      backed by slab caches that are not of large order, traditionally never
      greater than PAGE_ALLOC_COSTLY_ORDER.
      
      Thus, using GFP_KERNEL for these allocations on large machines when
      CONFIG_NODES_SHIFT > 8 will cause the page allocator to loop endlessly in
      the allocation attempt, each time invoking both direct reclaim or the oom
      killer.
      
      This is of particular interest when using NODEMASK_ALLOC() from a
      mempolicy context (either directly in mm/mempolicy.c or the mempolicy
      constrained hugetlb allocations) since the oom killer always kills current
      when allocations are constrained by mempolicies.  So for all present use
      cases in the kernel, current would end up being oom killed when direct
      reclaim fails.  That would allow the NODEMASK_ALLOC() to succeed but
      current would have sacrificed itself upon returning.
      
      This patch adds gfp flags to NODEMASK_ALLOC() to pass to kmalloc() on
      CONFIG_NODES_SHIFT > 8; this parameter is a nop on other configurations.
      All current use cases either directly from hugetlb code or indirectly via
      NODEMASK_SCRATCH() union __GFP_NORETRY to avoid direct reclaim and the oom
      killer when the slab allocator needs to allocate additional pages.
      
      The side-effect of this change is that all current use cases of either
      NODEMASK_ALLOC() or NODEMASK_SCRATCH() need appropriate -ENOMEM handling
      when the allocation fails (never for CONFIG_NODES_SHIFT <= 8).  All
      current use cases were audited and do have appropriate error handling at
      this time.
      Signed-off-by: default avatarDavid Rientjes <rientjes@google.com>
      Acked-by: default avatarKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Lee Schermerhorn <lee.schermerhorn@hp.com>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Cc: Randy Dunlap <randy.dunlap@oracle.com>
      Cc: Nishanth Aravamudan <nacc@us.ibm.com>
      Cc: Andi Kleen <andi@firstfloor.org>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Adam Litke <agl@us.ibm.com>
      Cc: Andy Whitcroft <apw@canonical.com>
      Cc: Eric Whitney <eric.whitney@hp.com>
      Cc: Christoph Lameter <cl@linux-foundation.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      bad44b5b
    • Lee Schermerhorn's avatar
      hugetlb: offload per node attribute registrations · 39da08cb
      Lee Schermerhorn authored
      Offload the registration and unregistration of per node hstate sysfs
      attributes to a worker thread rather than attempt the
      allocation/attachment or detachment/freeing of the attributes in the
      context of the memory hotplug handler.
      
      I don't know that this is absolutely required, but the registration can
      sleep in allocations and other mem hot plug handlers do it this way.  If
      it turns out this is NOT required, we can drop this patch.
      
      N.B.,  Only tested build, boot, libhugetlbfs regression.
             i.e., no memory hotplug testing.
      Signed-off-by: default avatarLee Schermerhorn <lee.schermerhorn@hp.com>
      Reviewed-by: default avatarAndi Kleen <andi@firstfloor.org>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Lee Schermerhorn <lee.schermerhorn@hp.com>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Cc: Randy Dunlap <randy.dunlap@oracle.com>
      Cc: Nishanth Aravamudan <nacc@us.ibm.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Adam Litke <agl@us.ibm.com>
      Cc: Andy Whitcroft <apw@canonical.com>
      Cc: Eric Whitney <eric.whitney@hp.com>
      Cc: Christoph Lameter <cl@linux-foundation.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      39da08cb
    • Lee Schermerhorn's avatar
      hugetlb: handle memory hot-plug events · 4faf8d95
      Lee Schermerhorn authored
      Register per node hstate attributes only for nodes with memory.  As
      suggested by David Rientjes.
      
      With Memory Hotplug, memory can be added to a memoryless node and a node
      with memory can become memoryless.  Therefore, add a memory on/off-line
      notifier callback to [un]register a node's attributes on transition
      to/from memoryless state.
      
      N.B.,  Only tested build, boot, libhugetlbfs regression.
             i.e., no memory hotplug testing.
      Signed-off-by: default avatarLee Schermerhorn <lee.schermerhorn@hp.com>
      Reviewed-by: default avatarAndi Kleen <andi@firstfloor.org>
      Acked-by: default avatarDavid Rientjes <rientjes@google.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Lee Schermerhorn <lee.schermerhorn@hp.com>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Cc: Randy Dunlap <randy.dunlap@oracle.com>
      Cc: Nishanth Aravamudan <nacc@us.ibm.com>
      Cc: Adam Litke <agl@us.ibm.com>
      Cc: Andy Whitcroft <apw@canonical.com>
      Cc: Eric Whitney <eric.whitney@hp.com>
      Cc: Christoph Lameter <cl@linux-foundation.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      4faf8d95
    • David Rientjes's avatar
      mm: clear node in N_HIGH_MEMORY and stop kswapd when all memory is offlined · 8fe23e05
      David Rientjes authored
      When memory is hot-removed, its node must be cleared in N_HIGH_MEMORY if
      there are no present pages left.
      
      In such a situation, kswapd must also be stopped since it has nothing left
      to do.
      Signed-off-by: default avatarDavid Rientjes <rientjes@google.com>
      Signed-off-by: default avatarLee Schermerhorn <lee.schermerhorn@hp.com>
      Cc: Christoph Lameter <cl@linux-foundation.org>
      Cc: Yasunori Goto <y-goto@jp.fujitsu.com>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Cc: Rafael J. Wysocki <rjw@sisk.pl>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Lee Schermerhorn <lee.schermerhorn@hp.com>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Cc: Randy Dunlap <randy.dunlap@oracle.com>
      Cc: Nishanth Aravamudan <nacc@us.ibm.com>
      Cc: Andi Kleen <andi@firstfloor.org>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Adam Litke <agl@us.ibm.com>
      Cc: Andy Whitcroft <apw@canonical.com>
      Cc: Eric Whitney <eric.whitney@hp.com>
      Cc: Christoph Lameter <cl@linux-foundation.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      8fe23e05
    • Lee Schermerhorn's avatar
      hugetlb: use only nodes with memory for huge pages · 9b5e5d0f
      Lee Schermerhorn authored
      Register per node hstate sysfs attributes only for nodes with memory.
      Global replacement of 'all online nodes" with "all nodes with memory" in
      mm/hugetlb.c.  Suggested by David Rientjes.
      
      A subsequent patch will handle adding/removing of per node hstate sysfs
      attributes when nodes transition to/from memoryless state via memory
      hotplug.
      
      NOTE: this patch has not been tested with memoryless nodes.
      Signed-off-by: default avatarLee Schermerhorn <lee.schermerhorn@hp.com>
      Reviewed-by: default avatarAndi Kleen <andi@firstfloor.org>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Cc: Randy Dunlap <randy.dunlap@oracle.com>
      Cc: Nishanth Aravamudan <nacc@us.ibm.com>
      Acked-by: default avatarDavid Rientjes <rientjes@google.com>
      Cc: Adam Litke <agl@us.ibm.com>
      Cc: Andy Whitcroft <apw@canonical.com>
      Cc: Eric Whitney <eric.whitney@hp.com>
      Cc: Christoph Lameter <cl@linux-foundation.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      9b5e5d0f
    • Lee Schermerhorn's avatar
      hugetlb: update hugetlb documentation for NUMA controls · 267b4c28
      Lee Schermerhorn authored
      Update the kernel huge tlb documentation to describe the numa memory
      policy based huge page management.  Additionaly, the patch includes a fair
      amount of rework to improve consistency, eliminate duplication and set the
      context for documenting the memory policy interaction.
      Signed-off-by: default avatarLee Schermerhorn <lee.schermerhorn@hp.com>
      Acked-by: default avatarDavid Rientjes <rientjes@google.com>
      Acked-by: default avatarMel Gorman <mel@csn.ul.ie>
      Reviewed-by: default avatarAndi Kleen <andi@firstfloor.org>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Lee Schermerhorn <lee.schermerhorn@hp.com>
      Cc: Randy Dunlap <randy.dunlap@oracle.com>
      Cc: Nishanth Aravamudan <nacc@us.ibm.com>
      Cc: Adam Litke <agl@us.ibm.com>
      Cc: Andy Whitcroft <apw@canonical.com>
      Cc: Eric Whitney <eric.whitney@hp.com>
      Cc: Christoph Lameter <cl@linux-foundation.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      267b4c28
    • Lee Schermerhorn's avatar
      hugetlb: add per node hstate attributes · 9a305230
      Lee Schermerhorn authored
      Add the per huge page size control/query attributes to the per node
      sysdevs:
      
      /sys/devices/system/node/node<ID>/hugepages/hugepages-<size>/
      	nr_hugepages       - r/w
      	free_huge_pages    - r/o
      	surplus_huge_pages - r/o
      
      The patch attempts to re-use/share as much of the existing global hstate
      attribute initialization and handling, and the "nodes_allowed" constraint
      processing as possible.
      
      Calling set_max_huge_pages() with no node indicates a change to global
      hstate parameters.  In this case, any non-default task mempolicy will be
      used to generate the nodes_allowed mask.  A valid node id indicates an
      update to that node's hstate parameters, and the count argument specifies
      the target count for the specified node.  From this info, we compute the
      target global count for the hstate and construct a nodes_allowed node mask
      contain only the specified node.
      
      Setting the node specific nr_hugepages via the per node attribute
      effectively ignores any task mempolicy or cpuset constraints.
      
      With this patch:
      
      (me):ls /sys/devices/system/node/node0/hugepages/hugepages-2048kB
      ./  ../  free_hugepages  nr_hugepages  surplus_hugepages
      
      Starting from:
      Node 0 HugePages_Total:     0
      Node 0 HugePages_Free:      0
      Node 0 HugePages_Surp:      0
      Node 1 HugePages_Total:     0
      Node 1 HugePages_Free:      0
      Node 1 HugePages_Surp:      0
      Node 2 HugePages_Total:     0
      Node 2 HugePages_Free:      0
      Node 2 HugePages_Surp:      0
      Node 3 HugePages_Total:     0
      Node 3 HugePages_Free:      0
      Node 3 HugePages_Surp:      0
      vm.nr_hugepages = 0
      
      Allocate 16 persistent huge pages on node 2:
      (me):echo 16 >/sys/devices/system/node/node2/hugepages/hugepages-2048kB/nr_hugepages
      
      [Note that this is equivalent to:
      	numactl -m 2 hugeadmin --pool-pages-min 2M:+16
      ]
      
      Yields:
      Node 0 HugePages_Total:     0
      Node 0 HugePages_Free:      0
      Node 0 HugePages_Surp:      0
      Node 1 HugePages_Total:     0
      Node 1 HugePages_Free:      0
      Node 1 HugePages_Surp:      0
      Node 2 HugePages_Total:    16
      Node 2 HugePages_Free:     16
      Node 2 HugePages_Surp:      0
      Node 3 HugePages_Total:     0
      Node 3 HugePages_Free:      0
      Node 3 HugePages_Surp:      0
      vm.nr_hugepages = 16
      
      Global controls work as expected--reduce pool to 8 persistent huge pages:
      (me):echo 8 >/sys/kernel/mm/hugepages/hugepages-2048kB/nr_hugepages
      
      Node 0 HugePages_Total:     0
      Node 0 HugePages_Free:      0
      Node 0 HugePages_Surp:      0
      Node 1 HugePages_Total:     0
      Node 1 HugePages_Free:      0
      Node 1 HugePages_Surp:      0
      Node 2 HugePages_Total:     8
      Node 2 HugePages_Free:      8
      Node 2 HugePages_Surp:      0
      Node 3 HugePages_Total:     0
      Node 3 HugePages_Free:      0
      Node 3 HugePages_Surp:      0
      Signed-off-by: default avatarLee Schermerhorn <lee.schermerhorn@hp.com>
      Acked-by: default avatarMel Gorman <mel@csn.ul.ie>
      Reviewed-by: default avatarAndi Kleen <andi@firstfloor.org>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Randy Dunlap <randy.dunlap@oracle.com>
      Cc: Nishanth Aravamudan <nacc@us.ibm.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Adam Litke <agl@us.ibm.com>
      Cc: Andy Whitcroft <apw@canonical.com>
      Cc: Eric Whitney <eric.whitney@hp.com>
      Cc: Christoph Lameter <cl@linux-foundation.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      9a305230
    • Lee Schermerhorn's avatar
      hugetlb: add generic definition of NUMA_NO_NODE · 4e25b257
      Lee Schermerhorn authored
      Move definition of NUMA_NO_NODE from ia64 and x86_64 arch specific headers
      to generic header 'linux/numa.h' for use in generic code.  NUMA_NO_NODE
      replaces bare '-1' where it's used in this series to indicate "no node id
      specified".  Ultimately, it can be used to replace the -1 elsewhere where
      it is used similarly.
      Signed-off-by: default avatarLee Schermerhorn <lee.schermerhorn@hp.com>
      Acked-by: default avatarDavid Rientjes <rientjes@google.com>
      Acked-by: default avatarMel Gorman <mel@csn.ul.ie>
      Reviewed-by: default avatarAndi Kleen <andi@firstfloor.org>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Randy Dunlap <randy.dunlap@oracle.com>
      Cc: Nishanth Aravamudan <nacc@us.ibm.com>
      Cc: Adam Litke <agl@us.ibm.com>
      Cc: Andy Whitcroft <apw@canonical.com>
      Cc: Eric Whitney <eric.whitney@hp.com>
      Cc: Christoph Lameter <cl@linux-foundation.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      4e25b257
    • Lee Schermerhorn's avatar
      hugetlb: derive huge pages nodes allowed from task mempolicy · 06808b08
      Lee Schermerhorn authored
      This patch derives a "nodes_allowed" node mask from the numa mempolicy of
      the task modifying the number of persistent huge pages to control the
      allocation, freeing and adjusting of surplus huge pages when the pool page
      count is modified via the new sysctl or sysfs attribute
      "nr_hugepages_mempolicy".  The nodes_allowed mask is derived as follows:
      
      * For "default" [NULL] task mempolicy, a NULL nodemask_t pointer
        is produced.  This will cause the hugetlb subsystem to use
        node_online_map as the "nodes_allowed".  This preserves the
        behavior before this patch.
      * For "preferred" mempolicy, including explicit local allocation,
        a nodemask with the single preferred node will be produced.
        "local" policy will NOT track any internode migrations of the
        task adjusting nr_hugepages.
      * For "bind" and "interleave" policy, the mempolicy's nodemask
        will be used.
      * Other than to inform the construction of the nodes_allowed node
        mask, the actual mempolicy mode is ignored.  That is, all modes
        behave like interleave over the resulting nodes_allowed mask
        with no "fallback".
      
      See the updated documentation [next patch] for more information
      about the implications of this patch.
      
      Examples:
      
      Starting with:
      
      	Node 0 HugePages_Total:     0
      	Node 1 HugePages_Total:     0
      	Node 2 HugePages_Total:     0
      	Node 3 HugePages_Total:     0
      
      Default behavior [with or without this patch] balances persistent
      hugepage allocation across nodes [with sufficient contiguous memory]:
      
      	sysctl vm.nr_hugepages[_mempolicy]=32
      
      yields:
      
      	Node 0 HugePages_Total:     8
      	Node 1 HugePages_Total:     8
      	Node 2 HugePages_Total:     8
      	Node 3 HugePages_Total:     8
      
      Of course, we only have nr_hugepages_mempolicy with the patch,
      but with default mempolicy, nr_hugepages_mempolicy behaves the
      same as nr_hugepages.
      
      Applying mempolicy--e.g., with numactl [using '-m' a.k.a.
      '--membind' because it allows multiple nodes to be specified
      and it's easy to type]--we can allocate huge pages on
      individual nodes or sets of nodes.  So, starting from the
      condition above, with 8 huge pages per node, add 8 more to
      node 2 using:
      
      	numactl -m 2 sysctl vm.nr_hugepages_mempolicy=40
      
      This yields:
      
      	Node 0 HugePages_Total:     8
      	Node 1 HugePages_Total:     8
      	Node 2 HugePages_Total:    16
      	Node 3 HugePages_Total:     8
      
      The incremental 8 huge pages were restricted to node 2 by the
      specified mempolicy.
      
      Similarly, we can use mempolicy to free persistent huge pages
      from specified nodes:
      
      	numactl -m 0,1 sysctl vm.nr_hugepages_mempolicy=32
      
      yields:
      
      	Node 0 HugePages_Total:     4
      	Node 1 HugePages_Total:     4
      	Node 2 HugePages_Total:    16
      	Node 3 HugePages_Total:     8
      
      The 8 huge pages freed were balanced over nodes 0 and 1.
      
      [rientjes@google.com: accomodate reworked NODEMASK_ALLOC]
      Signed-off-by: default avatarDavid Rientjes <rientjes@google.com>
      Signed-off-by: default avatarLee Schermerhorn <lee.schermerhorn@hp.com>
      Acked-by: default avatarMel Gorman <mel@csn.ul.ie>
      Reviewed-by: default avatarAndi Kleen <andi@firstfloor.org>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Randy Dunlap <randy.dunlap@oracle.com>
      Cc: Nishanth Aravamudan <nacc@us.ibm.com>
      Cc: Adam Litke <agl@us.ibm.com>
      Cc: Andy Whitcroft <apw@canonical.com>
      Cc: Eric Whitney <eric.whitney@hp.com>
      Cc: Christoph Lameter <cl@linux-foundation.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      06808b08