1. 24 Feb, 2013 22 commits
    • Michel Lespinasse's avatar
      mm: remap_file_pages() fixes · 940e7da5
      Michel Lespinasse authored
      We have many vma manipulation functions that are fast in the typical
      case, but can optionally be instructed to populate an unbounded number
      of ptes within the region they work on:
      
       - mmap with MAP_POPULATE or MAP_LOCKED flags;
       - remap_file_pages() with MAP_NONBLOCK not set or when working on a
         VM_LOCKED vma;
       - mmap_region() and all its wrappers when mlock(MCL_FUTURE) is in
         effect;
       - brk() when mlock(MCL_FUTURE) is in effect.
      
      Current code handles these pte operations locally, while the
      sourrounding code has to hold the mmap_sem write side since it's
      manipulating vmas.  This means we're doing an unbounded amount of pte
      population work with mmap_sem held, and this causes problems as Andy
      Lutomirski reported (we've hit this at Google as well, though it's not
      entirely clear why people keep trying to use mlock(MCL_FUTURE) in the
      first place).
      
      I propose introducing a new mm_populate() function to do this pte
      population work after the mmap_sem has been released.  mm_populate()
      does need to acquire the mmap_sem read side, but critically, it doesn't
      need to hold it continuously for the entire duration of the operation -
      it can drop it whenever things take too long (such as when hitting disk
      for a file read) and re-acquire it later on.
      
      The following patches are included
      
      - Patches 1 fixes some issues I noticed while working on the existing code.
        If needed, they could potentially go in before the rest of the patches.
      
      - Patch 2 introduces the new mm_populate() function and changes
        mmap_region() call sites to use it after they drop mmap_sem. This is
        inspired from Andy Lutomirski's proposal and is built as an extension
        of the work I had previously done for mlock() and mlockall() around
        v2.6.38-rc1. I had tried doing something similar at the time but had
        given up as there were so many do_mmap() call sites; the recent cleanups
        by Linus and Viro are a tremendous help here.
      
      - Patches 3-5 convert some of the less-obvious places doing unbounded
        pte populates to the new mm_populate() mechanism.
      
      - Patches 6-7 are code cleanups that are made possible by the
        mm_populate() work. In particular, they remove more code than the
        entire patch series added, which should be a good thing :)
      
      - Patch 8 is optional to this entire series. It only helps to deal more
        nicely with racy userspace programs that might modify their mappings
        while we're trying to populate them. It adds a new VM_POPULATE flag
        on the mappings we do want to populate, so that if userspace replaces
        them with mappings it doesn't want populated, mm_populate() won't
        populate those replacement mappings.
      
      This patch:
      
      Assorted small fixes. The first two are quite small:
      
      - Move check for vma->vm_private_data && !(vma->vm_flags & VM_NONLINEAR)
        within existing if (!(vma->vm_flags & VM_NONLINEAR)) block.
        Purely cosmetic.
      
      - In the VM_LOCKED case, when dropping PG_Mlocked for the over-mapped
        range, make sure we own the mmap_sem write lock around the
        munlock_vma_pages_range call as this manipulates the vma's vm_flags.
      
      Last fix requires a longer explanation. remap_file_pages() can do its work
      either through VM_NONLINEAR manipulation or by creating extra vmas.
      These two cases were inconsistent with each other (and ultimately, both wrong)
      as to exactly when did they fault in the newly mapped file pages:
      
      - In the VM_NONLINEAR case, new file pages would be populated if
        the MAP_NONBLOCK flag wasn't passed. If MAP_NONBLOCK was passed,
        new file pages wouldn't be populated even if the vma is already
        marked as VM_LOCKED.
      
      - In the linear (emulated) case, the work is passed to the mmap_region()
        function which would populate the pages if the vma is marked as
        VM_LOCKED, and would not otherwise - regardless of the value of the
        MAP_NONBLOCK flag, because MAP_POPULATE wasn't being passed to
        mmap_region().
      
      The desired behavior is that we want the pages to be populated and locked
      if the vma is marked as VM_LOCKED, or to be populated if the MAP_NONBLOCK
      flag is not passed to remap_file_pages().
      Signed-off-by: default avatarMichel Lespinasse <walken@google.com>
      Acked-by: default avatarRik van Riel <riel@redhat.com>
      Tested-by: default avatarAndy Lutomirski <luto@amacapital.net>
      Cc: Greg Ungerer <gregungerer@westnet.com.au>
      Cc: David Howells <dhowells@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      940e7da5
    • Zlatko Calusic's avatar
      mm: avoid calling pgdat_balanced() needlessly · dafcb73e
      Zlatko Calusic authored
      Now that balance_pgdat() is slightly tidied up, thanks to more capable
      pgdat_balanced(), it's become obvious that pgdat_balanced() is called to
      check the status, then break the loop if pgdat is balanced, just to be
      immediately called again.  The second call is completely unnecessary, of
      course.
      
      The patch introduces pgdat_is_balanced boolean, which helps resolve the
      above suboptimal behavior, with the added benefit of slightly better
      documenting one other place in the function where we jump and skip lots
      of code.
      Signed-off-by: default avatarZlatko Calusic <zlatko.calusic@iskon.hr>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Minchan Kim <minchan.kim@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      dafcb73e
    • Andrew Morton's avatar
      mm: compaction: make __compact_pgdat() and compact_pgdat() return void · 7103f16d
      Andrew Morton authored
      These functions always return 0.  Formalise this.
      
      Cc: Jason Liu <r64343@freescale.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Rik van Riel <riel@redhat.com>
      Acked-by: default avatarDavid Rientjes <rientjes@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      7103f16d
    • Shaohua Li's avatar
      mm: make madvise(MADV_WILLNEED) support swap file prefetch · 1998cc04
      Shaohua Li authored
      Make madvise(MADV_WILLNEED) support swap file prefetch.  If memory is
      swapout, this syscall can do swapin prefetch.  It has no impact if the
      memory isn't swapout.
      
      [akpm@linux-foundation.org: fix CONFIG_SWAP=n build]
      [sasha.levin@oracle.com: fix BUG on madvise early failure]
      Signed-off-by: default avatarShaohua Li <shli@fusionio.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Rik van Riel <riel@redhat.com>
      Signed-off-by: default avatarSasha Levin <sasha.levin@oracle.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      1998cc04
    • Michal Hocko's avatar
      memcg,vmscan: do not break out targeted reclaim without reclaimed pages · a394cb8e
      Michal Hocko authored
      Targeted (hard resp soft) reclaim has traditionally tried to scan one
      group with decreasing priority until nr_to_reclaim (SWAP_CLUSTER_MAX
      pages) is reclaimed or all priorities are exhausted.  The reclaim is
      then retried until the limit is met.
      
      This approach, however, doesn't work well with deeper hierarchies where
      groups higher in the hierarchy do not have any or only very few pages
      (this usually happens if those groups do not have any tasks and they
      have only re-parented pages after some of their children is removed).
      Those groups are reclaimed with decreasing priority pointlessly as there
      is nothing to reclaim from them.
      
      An easiest fix is to break out of the memcg iteration loop in
      shrink_zone only if the whole hierarchy has been visited or sufficient
      pages have been reclaimed.  This is also more natural because the
      reclaimer expects that the hierarchy under the given root is reclaimed.
      As a result we can simplify the soft limit reclaim which does its own
      iteration.
      
      [yinghan@google.com: break out of the hierarchy loop only if nr_reclaimed exceeded nr_to_reclaim]
      [akpm@linux-foundation.org: use conventional comparison order]
      Signed-off-by: default avatarMichal Hocko <mhocko@suse.cz>
      Reported-by: default avatarYing Han <yinghan@google.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Tejun Heo <htejun@gmail.com>
      Cc: Glauber Costa <glommer@parallels.com>
      Cc: Li Zefan <lizefan@huawei.com>
      Signed-off-by: default avatarYing Han <yinghan@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      a394cb8e
    • Sasha Levin's avatar
      mm/ksm.c: use new hashtable implementation · 4ca3a69b
      Sasha Levin authored
      Switch ksm to use the new hashtable implementation.  This reduces the
      amount of generic unrelated code in the ksm module.
      Signed-off-by: default avatarSasha Levin <levinsasha928@gmail.com>
      Acked-by: default avatarHugh Dickins <hughd@google.com>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: Konstantin Khlebnikov <khlebnikov@openvz.org>
      Cc: Mel Gorman <mgorman@suse.de>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      4ca3a69b
    • Sasha Levin's avatar
      mm/huge_memory.c: use new hashtable implementation · 43b5fbbd
      Sasha Levin authored
      Switch hugemem to use the new hashtable implementation.  This reduces
      the amount of generic unrelated code in the hugemem.
      
      This also removes the dymanic allocation of the hash table.  The upside
      is that we save a pointer dereference when accessing the hashtable, but
      we lose 8KB if CONFIG_TRANSPARENT_HUGEPAGE is enabled but the processor
      doesn't support hugepages.
      Signed-off-by: default avatarSasha Levin <sasha.levin@oracle.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      43b5fbbd
    • Mel Gorman's avatar
      mm: compaction: do not accidentally skip pageblocks in the migrate scanner · a9aacbcc
      Mel Gorman authored
      Compaction uses the ALIGN macro incorrectly with the migrate scanner by
      adding pageblock_nr_pages to a PFN.  It happened to work when initially
      implemented as the starting PFN was also aligned but with caching
      restarts and isolating in smaller chunks this is no longer always true.
      
      The impact is that the migrate scanner scans outside its current
      pageblock.  As pfn_valid() is still checked properly it does not cause
      any failure and the impact of the bug is that in some cases it will scan
      more than necessary when it crosses a page boundary but by no more than
      COMPACT_CLUSTER_MAX.  It is highly unlikely this is even measurable but
      it's still wrong so this patch addresses the problem.
      Signed-off-by: default avatarMel Gorman <mgorman@suse.de>
      Reviewed-by: default avatarRik van Riel <riel@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      a9aacbcc
    • Andrew Morton's avatar
      mm/vmscan.c:__zone_reclaim(): replace max_t() with max() · 62b726c1
      Andrew Morton authored
      "mm: vmscan: save work scanning (almost) empty LRU lists" made
      SWAP_CLUSTER_MAX an unsigned long.
      
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Satoru Moriya <satoru.moriya@hds.com>
      Cc: Simon Jeons <simon.jeons@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      62b726c1
    • Andrew Morton's avatar
      mm/page_alloc.c:__setup_per_zone_wmarks: make min_pages unsigned long · 90ae8d67
      Andrew Morton authored
      `int' is an inappropriate type for a number-of-pages counter.
      
      While we're there, use the clamp() macro.
      Acked-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Reviewed-by: default avatarMichal Hocko <mhocko@suse.cz>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Satoru Moriya <satoru.moriya@hds.com>
      Cc: Simon Jeons <simon.jeons@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      90ae8d67
    • Johannes Weiner's avatar
      mm: reduce rmap overhead for ex-KSM page copies created on swap faults · af34770e
      Johannes Weiner authored
      When ex-KSM pages are faulted from swap cache, the fault handler is not
      capable of re-establishing anon_vma-spanning KSM pages.  In this case, a
      copy of the page is created instead, just like during a COW break.
      
      These freshly made copies are known to be exclusive to the faulting VMA
      and there is no reason to go look for this page in parent and sibling
      processes during rmap operations.
      
      Use page_add_new_anon_rmap() for these copies.  This also puts them on
      the proper LRU lists and marks them SwapBacked, so we can get rid of
      doing this ad-hoc in the KSM copy code.
      Signed-off-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Reviewed-by: default avatarRik van Riel <riel@redhat.com>
      Acked-by: default avatarHugh Dickins <hughd@google.com>
      Cc: Simon Jeons <simon.jeons@gmail.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: Satoru Moriya <satoru.moriya@hds.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      af34770e
    • Johannes Weiner's avatar
      mm: vmscan: compaction works against zones, not lruvecs · 9b4f98cd
      Johannes Weiner authored
      The restart logic for when reclaim operates back to back with compaction
      is currently applied on the lruvec level.  But this does not make sense,
      because the container of interest for compaction is a zone as a whole,
      not the zone pages that are part of a certain memory cgroup.
      
      Negative impact is bounded.  For one, the code checks that the lruvec
      has enough reclaim candidates, so it does not risk getting stuck on a
      condition that can not be fulfilled.  And the unfairness of hammering on
      one particular memory cgroup to make progress in a zone will be
      amortized by the round robin manner in which reclaim goes through the
      memory cgroups.  Still, this can lead to unnecessary allocation
      latencies when the code elects to restart on a hard to reclaim or small
      group when there are other, more reclaimable groups in the zone.
      
      Move this logic to the zone level and restart reclaim for all memory
      cgroups in a zone when compaction requires more free pages from it.
      
      [akpm@linux-foundation.org: no need for min_t]
      Signed-off-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Reviewed-by: default avatarRik van Riel <riel@redhat.com>
      Acked-by: default avatarMel Gorman <mgorman@suse.de>
      Reviewed-by: default avatarMichal Hocko <mhocko@suse.cz>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Satoru Moriya <satoru.moriya@hds.com>
      Cc: Simon Jeons <simon.jeons@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      9b4f98cd
    • Johannes Weiner's avatar
      mm: vmscan: clean up get_scan_count() · 9a265114
      Johannes Weiner authored
      Reclaim pressure balance between anon and file pages is calculated
      through a tuple of numerators and a shared denominator.
      
      Exceptional cases that want to force-scan anon or file pages configure
      the numerators and denominator such that one list is preferred, which is
      not necessarily the most obvious way:
      
          fraction[0] = 1;
          fraction[1] = 0;
          denominator = 1;
          goto out;
      
      Make this easier by making the force-scan cases explicit and use the
      fractionals only in case they are calculated from reclaim history.
      
      [akpm@linux-foundation.org: avoid using unintialized_var()]
      Signed-off-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Reviewed-by: default avatarRik van Riel <riel@redhat.com>
      Acked-by: default avatarMel Gorman <mgorman@suse.de>
      Reviewed-by: default avatarMichal Hocko <mhocko@suse.cz>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Satoru Moriya <satoru.moriya@hds.com>
      Cc: Simon Jeons <simon.jeons@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      9a265114
    • Johannes Weiner's avatar
      mm: vmscan: improve comment on low-page cache handling · 11d16c25
      Johannes Weiner authored
      Fix comment style and elaborate on why anonymous memory is force-scanned
      when file cache runs low.
      Signed-off-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Reviewed-by: default avatarRik van Riel <riel@redhat.com>
      Acked-by: default avatarMel Gorman <mgorman@suse.de>
      Reviewed-by: default avatarMichal Hocko <mhocko@suse.cz>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Satoru Moriya <satoru.moriya@hds.com>
      Cc: Simon Jeons <simon.jeons@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      11d16c25
    • Johannes Weiner's avatar
      mm: vmscan: clarify how swappiness, highest priority, memcg interact · 10316b31
      Johannes Weiner authored
      A swappiness of 0 has a slightly different meaning for global reclaim
      (may swap if file cache really low) and memory cgroup reclaim (never
      swap, ever).
      
      In addition, global reclaim at highest priority will scan all LRU lists
      equal to their size and ignore other balancing heuristics.  UNLESS
      swappiness forbids swapping, then the lists are balanced based on recent
      reclaim effectiveness.  UNLESS file cache is running low, then anonymous
      pages are force-scanned.
      
      This (total mess of a) behaviour is implicit and not obvious from the
      way the code is organized.  At least make it apparent in the code flow
      and document the conditions.  It will be it easier to come up with sane
      semantics later.
      Signed-off-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Reviewed-by: default avatarRik van Riel <riel@redhat.com>
      Reviewed-by: default avatarSatoru Moriya <satoru.moriya@hds.com>
      Reviewed-by: default avatarMichal Hocko <mhocko@suse.cz>
      Acked-by: default avatarMel Gorman <mgorman@suse.de>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Simon Jeons <simon.jeons@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      10316b31
    • Johannes Weiner's avatar
      mm: vmscan: save work scanning (almost) empty LRU lists · d778df51
      Johannes Weiner authored
      In certain cases (kswapd reclaim, memcg target reclaim), a fixed minimum
      amount of pages is scanned from the LRU lists on each iteration, to make
      progress.
      
      Do not make this minimum bigger than the respective LRU list size,
      however, and save some busy work trying to isolate and reclaim pages
      that are not there.
      
      Empty LRU lists are quite common with memory cgroups in NUMA
      environments because there exists a set of LRU lists for each zone for
      each memory cgroup, while the memory of a single cgroup is expected to
      stay on just one node.  The number of expected empty LRU lists is thus
      
        memcgs * (nodes - 1) * lru types
      
      Each attempt to reclaim from an empty LRU list does expensive size
      comparisons between lists, acquires the zone's lru lock etc.  Avoid
      that.
      Signed-off-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Reviewed-by: default avatarRik van Riel <riel@redhat.com>
      Acked-by: default avatarMel Gorman <mgorman@suse.de>
      Reviewed-by: default avatarMichal Hocko <mhocko@suse.cz>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Satoru Moriya <satoru.moriya@hds.com>
      Cc: Simon Jeons <simon.jeons@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      d778df51
    • Johannes Weiner's avatar
      mm: memcg: only evict file pages when we have plenty · 7c5bd705
      Johannes Weiner authored
      Commit e9868505 ("mm, vmscan: only evict file pages when we have
      plenty") makes a point of not going for anonymous memory while there is
      still enough inactive cache around.
      
      The check was added only for global reclaim, but it is just as useful to
      reduce swapping in memory cgroup reclaim:
      
          200M-memcg-defconfig-j2
      
                                           vanilla                   patched
          Real time              454.06 (  +0.00%)         453.71 (  -0.08%)
          User time              668.57 (  +0.00%)         668.73 (  +0.02%)
          System time            128.92 (  +0.00%)         129.53 (  +0.46%)
          Swap in               1246.80 (  +0.00%)         814.40 ( -34.65%)
          Swap out              1198.90 (  +0.00%)         827.00 ( -30.99%)
          Pages allocated   16431288.10 (  +0.00%)    16434035.30 (  +0.02%)
          Major faults           681.50 (  +0.00%)         593.70 ( -12.86%)
          THP faults             237.20 (  +0.00%)         242.40 (  +2.18%)
          THP collapse           241.20 (  +0.00%)         248.50 (  +3.01%)
          THP splits             157.30 (  +0.00%)         161.40 (  +2.59%)
      Signed-off-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: default avatarMichal Hocko <mhocko@suse.cz>
      Acked-by: default avatarRik van Riel <riel@redhat.com>
      Acked-by: default avatarMel Gorman <mgorman@suse.de>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Satoru Moriya <satoru.moriya@hds.com>
      Cc: Simon Jeons <simon.jeons@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      7c5bd705
    • Srinivas Pandruvada's avatar
      CMA: make putback_lru_pages() call conditional · 2a6f5124
      Srinivas Pandruvada authored
      As per documentation and other places calling putback_lru_pages(),
      putback_lru_pages() is called on error only.  Make the CMA code behave
      consistently.
      
      [akpm@linux-foundation.org: remove a test-n-branch in the wrapup code]
      Signed-off-by: default avatarSrinivas Pandruvada <srinivas.pandruvada@linux.intel.com>
      Acked-by: default avatarMichal Nazarewicz <mina86@mina86.com>
      Cc: Marek Szyprowski <m.szyprowski@samsung.com>
      Cc: Bartlomiej Zolnierkiewicz <b.zolnierkie@samsung.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      2a6f5124
    • Andrew Morton's avatar
      mm/hugetlb.c: convert to pr_foo() · ffb22af5
      Andrew Morton authored
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Acked-by: default avatarHillf Danton <dhillf@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      ffb22af5
    • Andrew Morton's avatar
      mm/memcontrol.c: convert printk(KERN_FOO) to pr_foo() · d045197f
      Andrew Morton authored
      Acked-by: default avatarSha Zhengju <handai.szj@taobao.com>
      Acked-by: default avatarMichal Hocko <mhocko@suse.cz>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: David Rientjes <rientjes@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      d045197f
    • Sha Zhengju's avatar
      memcg, oom: provide more precise dump info while memcg oom happening · 58cf188e
      Sha Zhengju authored
      Currently when a memcg oom is happening the oom dump messages is still
      global state and provides few useful info for users.  This patch prints
      more pointed memcg page statistics for memcg-oom and take hierarchy into
      consideration:
      
      Based on Michal's advice, we take hierarchy into consideration: supppose
      we trigger an OOM on A's limit
      
              root_memcg
                  |
                  A (use_hierachy=1)
                 / \
                B   C
                |
                D
      then the printed info will be:
      
        Memory cgroup stats for /A:...
        Memory cgroup stats for /A/B:...
        Memory cgroup stats for /A/C:...
        Memory cgroup stats for /A/B/D:...
      
      Following are samples of oom output:
      
      (1) Before change:
      
          mal-80 invoked oom-killer:gfp_mask=0xd0, order=0, oom_score_adj=0
          mal-80 cpuset=/ mems_allowed=0
          Pid: 2976, comm: mal-80 Not tainted 3.7.0+ #10
          Call Trace:
           [<ffffffff8167fbfb>] dump_header+0x83/0x1ca
           ..... (call trace)
           [<ffffffff8168a818>] page_fault+0x28/0x30
                                   <<<<<<<<<<<<<<<<<<<<< memcg specific information
          Task in /A/B/D killed as a result of limit of /A
          memory: usage 101376kB, limit 101376kB, failcnt 57
          memory+swap: usage 101376kB, limit 101376kB, failcnt 0
          kmem: usage 0kB, limit 9007199254740991kB, failcnt 0
                                   <<<<<<<<<<<<<<<<<<<<< print per cpu pageset stat
          Mem-Info:
          Node 0 DMA per-cpu:
          CPU    0: hi:    0, btch:   1 usd:   0
          ......
          CPU    3: hi:    0, btch:   1 usd:   0
          Node 0 DMA32 per-cpu:
          CPU    0: hi:  186, btch:  31 usd: 173
          ......
          CPU    3: hi:  186, btch:  31 usd: 130
                                   <<<<<<<<<<<<<<<<<<<<< print global page state
          active_anon:92963 inactive_anon:40777 isolated_anon:0
           active_file:33027 inactive_file:51718 isolated_file:0
           unevictable:0 dirty:3 writeback:0 unstable:0
           free:729995 slab_reclaimable:6897 slab_unreclaimable:6263
           mapped:20278 shmem:35971 pagetables:5885 bounce:0
           free_cma:0
                                   <<<<<<<<<<<<<<<<<<<<< print per zone page state
          Node 0 DMA free:15836kB ... all_unreclaimable? no
          lowmem_reserve[]: 0 3175 3899 3899
          Node 0 DMA32 free:2888564kB ... all_unrelaimable? no
          lowmem_reserve[]: 0 0 724 724
          lowmem_reserve[]: 0 0 0 0
          Node 0 DMA: 1*4kB (U) ... 3*4096kB (M) = 15836kB
          Node 0 DMA32: 41*4kB (UM) ... 702*4096kB (MR) = 2888316kB
          120710 total pagecache pages
          0 pages in swap cache
                                   <<<<<<<<<<<<<<<<<<<<< print global swap cache stat
          Swap cache stats: add 0, delete 0, find 0/0
          Free swap  = 499708kB
          Total swap = 499708kB
          1040368 pages RAM
          58678 pages reserved
          169065 pages shared
          173632 pages non-shared
          [ pid ]   uid  tgid total_vm      rss nr_ptes swapents oom_score_adj name
          [ 2693]     0  2693     6005     1324      17        0             0 god
          [ 2754]     0  2754     6003     1320      16        0             0 god
          [ 2811]     0  2811     5992     1304      18        0             0 god
          [ 2874]     0  2874     6005     1323      18        0             0 god
          [ 2935]     0  2935     8720     7742      21        0             0 mal-30
          [ 2976]     0  2976    21520    17577      42        0             0 mal-80
          Memory cgroup out of memory: Kill process 2976 (mal-80) score 665 or sacrifice child
          Killed process 2976 (mal-80) total-vm:86080kB, anon-rss:69964kB, file-rss:344kB
      
      We can see that messages dumped by show_free_areas() are longsome and can
      provide so limited info for memcg that just happen oom.
      
      (2) After change
          mal-80 invoked oom-killer: gfp_mask=0xd0, order=0, oom_score_adj=0
          mal-80 cpuset=/ mems_allowed=0
          Pid: 2704, comm: mal-80 Not tainted 3.7.0+ #10
          Call Trace:
           [<ffffffff8167fd0b>] dump_header+0x83/0x1d1
           .......(call trace)
           [<ffffffff8168a918>] page_fault+0x28/0x30
          Task in /A/B/D killed as a result of limit of /A
                                   <<<<<<<<<<<<<<<<<<<<< memcg specific information
          memory: usage 102400kB, limit 102400kB, failcnt 140
          memory+swap: usage 102400kB, limit 102400kB, failcnt 0
          kmem: usage 0kB, limit 9007199254740991kB, failcnt 0
          Memory cgroup stats for /A: cache:32KB rss:30984KB mapped_file:0KB swap:0KB inactive_anon:6912KB active_anon:24072KB inactive_file:32KB active_file:0KB unevictable:0KB
          Memory cgroup stats for /A/B: cache:0KB rss:0KB mapped_file:0KB swap:0KB inactive_anon:0KB active_anon:0KB inactive_file:0KB active_file:0KB unevictable:0KB
          Memory cgroup stats for /A/C: cache:0KB rss:0KB mapped_file:0KB swap:0KB inactive_anon:0KB active_anon:0KB inactive_file:0KB active_file:0KB unevictable:0KB
          Memory cgroup stats for /A/B/D: cache:32KB rss:71352KB mapped_file:0KB swap:0KB inactive_anon:6656KB active_anon:64696KB inactive_file:16KB active_file:16KB unevictable:0KB
          [ pid ]   uid  tgid total_vm      rss nr_ptes swapents oom_score_adj name
          [ 2260]     0  2260     6006     1325      18        0             0 god
          [ 2383]     0  2383     6003     1319      17        0             0 god
          [ 2503]     0  2503     6004     1321      18        0             0 god
          [ 2622]     0  2622     6004     1321      16        0             0 god
          [ 2695]     0  2695     8720     7741      22        0             0 mal-30
          [ 2704]     0  2704    21520    17839      43        0             0 mal-80
          Memory cgroup out of memory: Kill process 2704 (mal-80) score 669 or sacrifice child
          Killed process 2704 (mal-80) total-vm:86080kB, anon-rss:71016kB, file-rss:340kB
      
      This version provides more pointed info for memcg in "Memory cgroup stats
      for XXX" section.
      Signed-off-by: default avatarSha Zhengju <handai.szj@taobao.com>
      Acked-by: default avatarMichal Hocko <mhocko@suse.cz>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Acked-by: default avatarDavid Rientjes <rientjes@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      58cf188e
    • Andrew Morton's avatar
      drivers/md/persistent-data/dm-transaction-manager.c: rename HASH_SIZE · df855798
      Andrew Morton authored
      Fix the warning:
      
        drivers/md/persistent-data/dm-transaction-manager.c:28:1: warning: "HASH_SIZE" redefined
        In file included from include/linux/elevator.h:5,
                         from include/linux/blkdev.h:216,
                         from drivers/md/persistent-data/dm-block-manager.h:11,
                         from drivers/md/persistent-data/dm-transaction-manager.h:10,
                         from drivers/md/persistent-data/dm-transaction-manager.c:6:
        include/linux/hashtable.h:22:1: warning: this is the location of the previous definition
      
      Cc: Alasdair Kergon <agk@redhat.com>
      Cc: Neil Brown <neilb@suse.de>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      df855798
  2. 22 Feb, 2013 18 commits