1. 01 Nov, 2011 40 commits
    • Alex,Shi's avatar
      kswapd: avoid unnecessary rebalance after an unsuccessful balancing · d2ebd0f6
      Alex,Shi authored
      In commit 215ddd66 ("mm: vmscan: only read new_classzone_idx from pgdat
      when reclaiming successfully") , Mel Gorman said kswapd is better to sleep
      after a unsuccessful balancing if there is tighter reclaim request pending
      in the balancing.  But in the following scenario, kswapd do something that
      is not matched our expectation.  The patch fixes this issue.
      
      1, Read pgdat request A (classzone_idx, order = 3)
      2, balance_pgdat()
      3, During pgdat, a new pgdat request B (classzone_idx, order = 5) is placed
      4, balance_pgdat() returns but failed since returned order = 0
      5, pgdat of request A assigned to balance_pgdat(), and do balancing again.
         While the expectation behavior of kswapd should try to sleep.
      Signed-off-by: default avatarAlex Shi <alex.shi@intel.com>
      Reviewed-by: default avatarTim Chen <tim.c.chen@linux.intel.com>
      Acked-by: default avatarMel Gorman <mgorman@suse.de>
      Tested-by: default avatarPádraig Brady <P@draigBrady.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      d2ebd0f6
    • Akinobu Mita's avatar
      debug-pagealloc: add support for highmem pages · 64212ec5
      Akinobu Mita authored
      This adds support for highmem pages poisoning and verification to the
      debug-pagealloc feature for no-architecture support.
      
      [akpm@linux-foundation.org: remove unneeded preempt_disable/enable]
      Signed-off-by: default avatarAkinobu Mita <akinobu.mita@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      64212ec5
    • Joe Perches's avatar
      mm: neaten warn_alloc_failed · 3ee9a4f0
      Joe Perches authored
      Add __attribute__((format (printf...) to the function to validate format
      and arguments.  Use vsprintf extension %pV to avoid any possible message
      interleaving.  Coalesce format string.  Convert printks/pr_warning to
      pr_warn.
      
      [akpm@linux-foundation.org: use the __printf() macro]
      Signed-off-by: default avatarJoe Perches <joe@perches.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      3ee9a4f0
    • Sonic Zhang's avatar
      include/asm-generic/page.h: calculate virt_to_page and page_to_virt via predefined macro · 06d5e032
      Sonic Zhang authored
      On NOMMU architectures, if physical memory doesn't start from 0,
      ARCH_PFN_OFFSET is defined to generate page index in mem_map array.
      Because virtual address is equal to physical address, PAGE_OFFSET is
      always 0.  virt_to_page and page_to_virt should not index page by
      PAGE_OFFSET directly.
      Signed-off-by: default avatarSonic Zhang <sonic.zhang@analog.com>
      Cc: Greg Ungerer <gerg@snapgear.com>
      Cc: Geert Uytterhoeven <geert@linux-m68k.org>
      Cc: David Howells <dhowells@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      06d5e032
    • Andrea Arcangeli's avatar
      thp: mremap support and TLB optimization · 37a1c49a
      Andrea Arcangeli authored
      This adds THP support to mremap (decreases the number of split_huge_page()
      calls).
      
      Here are also some benchmarks with a proggy like this:
      
      ===
      #define _GNU_SOURCE
      #include <sys/mman.h>
      #include <stdlib.h>
      #include <stdio.h>
      #include <string.h>
      #include <sys/time.h>
      
      #define SIZE (5UL*1024*1024*1024)
      
      int main()
      {
              static struct timeval oldstamp, newstamp;
      	long diffsec;
      	char *p, *p2, *p3, *p4;
      	if (posix_memalign((void **)&p, 2*1024*1024, SIZE))
      		perror("memalign"), exit(1);
      	if (posix_memalign((void **)&p2, 2*1024*1024, SIZE))
      		perror("memalign"), exit(1);
      	if (posix_memalign((void **)&p3, 2*1024*1024, 4096))
      		perror("memalign"), exit(1);
      
      	memset(p, 0xff, SIZE);
      	memset(p2, 0xff, SIZE);
      	memset(p3, 0x77, 4096);
      	gettimeofday(&oldstamp, NULL);
      	p4 = mremap(p, SIZE, SIZE, MREMAP_FIXED|MREMAP_MAYMOVE, p3);
      	gettimeofday(&newstamp, NULL);
      	diffsec = newstamp.tv_sec - oldstamp.tv_sec;
      	diffsec = newstamp.tv_usec - oldstamp.tv_usec + 1000000 * diffsec;
      	printf("usec %ld\n", diffsec);
      	if (p == MAP_FAILED || p4 != p3)
      	//if (p == MAP_FAILED)
      		perror("mremap"), exit(1);
      	if (memcmp(p4, p2, SIZE))
      		printf("mremap bug\n"), exit(1);
      	printf("ok\n");
      
      	return 0;
      }
      ===
      
      THP on
      
       Performance counter stats for './largepage13' (3 runs):
      
                69195836 dTLB-loads                 ( +-   3.546% )  (scaled from 50.30%)
                   60708 dTLB-load-misses           ( +-  11.776% )  (scaled from 52.62%)
               676266476 dTLB-stores                ( +-   5.654% )  (scaled from 69.54%)
                   29856 dTLB-store-misses          ( +-   4.081% )  (scaled from 89.22%)
              1055848782 iTLB-loads                 ( +-   4.526% )  (scaled from 80.18%)
                    8689 iTLB-load-misses           ( +-   2.987% )  (scaled from 58.20%)
      
              7.314454164  seconds time elapsed   ( +-   0.023% )
      
      THP off
      
       Performance counter stats for './largepage13' (3 runs):
      
              1967379311 dTLB-loads                 ( +-   0.506% )  (scaled from 60.59%)
                 9238687 dTLB-load-misses           ( +-  22.547% )  (scaled from 61.87%)
              2014239444 dTLB-stores                ( +-   0.692% )  (scaled from 60.40%)
                 3312335 dTLB-store-misses          ( +-   7.304% )  (scaled from 67.60%)
              6764372065 iTLB-loads                 ( +-   0.925% )  (scaled from 79.00%)
                    8202 iTLB-load-misses           ( +-   0.475% )  (scaled from 70.55%)
      
              9.693655243  seconds time elapsed   ( +-   0.069% )
      
      grep thp /proc/vmstat
      thp_fault_alloc 35849
      thp_fault_fallback 0
      thp_collapse_alloc 3
      thp_collapse_alloc_failed 0
      thp_split 0
      
      thp_split 0 confirms no thp split despite plenty of hugepages allocated.
      
      The measurement of only the mremap time (so excluding the 3 long
      memset and final long 10GB memory accessing memcmp):
      
      THP on
      
      usec 14824
      usec 14862
      usec 14859
      
      THP off
      
      usec 256416
      usec 255981
      usec 255847
      
      With an older kernel without the mremap optimizations (the below patch
      optimizes the non THP version too).
      
      THP on
      
      usec 392107
      usec 390237
      usec 404124
      
      THP off
      
      usec 444294
      usec 445237
      usec 445820
      
      I guess with a threaded program that sends more IPI on large SMP it'd
      create an even larger difference.
      
      All debug options are off except DEBUG_VM to avoid skewing the
      results.
      
      The only problem for native 2M mremap like it happens above both the
      source and destination address must be 2M aligned or the hugepmd can't be
      moved without a split but that is an hardware limitation.
      
      [akpm@linux-foundation.org: coding-style nitpicking]
      Signed-off-by: default avatarAndrea Arcangeli <aarcange@redhat.com>
      Acked-by: default avatarJohannes Weiner <jweiner@redhat.com>
      Acked-by: default avatarMel Gorman <mgorman@suse.de>
      Acked-by: default avatarRik van Riel <riel@redhat.com>
      Cc: Hugh Dickins <hughd@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      37a1c49a
    • Andrea Arcangeli's avatar
      mremap: avoid sending one IPI per page · 7b6efc2b
      Andrea Arcangeli authored
      This replaces ptep_clear_flush() with ptep_get_and_clear() and a single
      flush_tlb_range() at the end of the loop, to avoid sending one IPI for
      each page.
      
      The mmu_notifier_invalidate_range_start/end section is enlarged
      accordingly but this is not going to fundamentally change things.  It was
      more by accident that the region under mremap was for the most part still
      available for secondary MMUs: the primary MMU was never allowed to
      reliably access that region for the duration of the mremap (modulo
      trapping SIGSEGV on the old address range which sounds unpractical and
      flakey).  If users wants secondary MMUs not to lose access to a large
      region under mremap they should reduce the mremap size accordingly in
      userland and run multiple calls.  Overall this will run faster so it's
      actually going to reduce the time the region is under mremap for the
      primary MMU which should provide a net benefit to apps.
      
      For KVM this is a noop because the guest physical memory is never
      mremapped, there's just no point it ever moving it while guest runs.  One
      target of this optimization is JVM GC (so unrelated to the mmu notifier
      logic).
      Signed-off-by: default avatarAndrea Arcangeli <aarcange@redhat.com>
      Acked-by: default avatarJohannes Weiner <jweiner@redhat.com>
      Acked-by: default avatarMel Gorman <mgorman@suse.de>
      Acked-by: default avatarRik van Riel <riel@redhat.com>
      Cc: Hugh Dickins <hughd@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      7b6efc2b
    • Andrea Arcangeli's avatar
      mremap: check for overflow using deltas · ebed4846
      Andrea Arcangeli authored
      Using "- 1" relies on the old_end to be page aligned and PAGE_SIZE > 1,
      those are reasonable requirements but the check remains obscure and it
      looks more like an off by one error than an overflow check.  This I feel
      will improve readability.
      Signed-off-by: default avatarAndrea Arcangeli <aarcange@redhat.com>
      Acked-by: default avatarJohannes Weiner <jweiner@redhat.com>
      Acked-by: default avatarMel Gorman <mgorman@suse.de>
      Acked-by: default avatarRik van Riel <riel@redhat.com>
      Cc: Hugh Dickins <hughd@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      ebed4846
    • Sam Ravnborg's avatar
      memblock: add NO_BOOTMEM config symbol · 66616720
      Sam Ravnborg authored
      With the NO_BOOTMEM symbol added architectures may now use the following
      syntax to tell that they do not need bootmem:
      
      	select NO_BOOTMEM
      
      This is much more convinient than adding a new kconfig symbol which was
      otherwise required.
      
      Adding this symbol does not conflict with the architctures that already
      define their own symbol.
      Signed-off-by: default avatarSam Ravnborg <sam@ravnborg.org>
      Cc: Yinghai Lu <yinghai@kernel.org>
      Acked-by: default avatarTejun Heo <tj@kernel.org>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      66616720
    • Sam Ravnborg's avatar
      memblock: add memblock_start_of_DRAM() · 0a93ebef
      Sam Ravnborg authored
      SPARC32 require access to the start address.  Add a new helper
      memblock_start_of_DRAM() to give access to the address of the first
      memblock - which contains the lowest address.
      
      The awkward name was chosen to match the already present
      memblock_end_of_DRAM().
      Signed-off-by: default avatarSam Ravnborg <sam@ravnborg.org>
      Cc: "David S. Miller" <davem@davemloft.net>
      Cc: Yinghai Lu <yinghai@kernel.org>
      Acked-by: default avatarTejun Heo <tj@kernel.org>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      0a93ebef
    • Mitsuo Hayasaka's avatar
      mm: avoid null pointer access in vm_struct via /proc/vmallocinfo · f5252e00
      Mitsuo Hayasaka authored
      The /proc/vmallocinfo shows information about vmalloc allocations in
      vmlist that is a linklist of vm_struct.  It, however, may access pages
      field of vm_struct where a page was not allocated.  This results in a null
      pointer access and leads to a kernel panic.
      
      Why this happens: In __vmalloc_node_range() called from vmalloc(), newly
      allocated vm_struct is added to vmlist at __get_vm_area_node() and then,
      some fields of vm_struct such as nr_pages and pages are set at
      __vmalloc_area_node().  In other words, it is added to vmlist before it is
      fully initialized.  At the same time, when the /proc/vmallocinfo is read,
      it accesses the pages field of vm_struct according to the nr_pages field
      at show_numa_info().  Thus, a null pointer access happens.
      
      The patch adds the newly allocated vm_struct to the vmlist *after* it is
      fully initialized.  So, it can avoid accessing the pages field with
      unallocated page when show_numa_info() is called.
      Signed-off-by: default avatarMitsuo Hayasaka <mitsuo.hayasaka.hu@hitachi.com>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Namhyung Kim <namhyung@gmail.com>
      Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
      Cc: Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>
      Cc: <stable@kernel.org>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      f5252e00
    • Akinobu Mita's avatar
      mm/debug-pagealloc.c: use memchr_inv · 8c5fb8ea
      Akinobu Mita authored
      Use newly introduced memchr_inv() for page verification.
      Signed-off-by: default avatarAkinobu Mita <akinobu.mita@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      8c5fb8ea
    • Akinobu Mita's avatar
      lib/string.c: introduce memchr_inv() · 79824820
      Akinobu Mita authored
      memchr_inv() is mainly used to check whether the whole buffer is filled
      with just a specified byte.
      
      The function name and prototype are stolen from logfs and the
      implementation is from SLUB.
      Signed-off-by: default avatarAkinobu Mita <akinobu.mita@gmail.com>
      Acked-by: default avatarChristoph Lameter <cl@linux-foundation.org>
      Acked-by: default avatarPekka Enberg <penberg@kernel.org>
      Cc: Matt Mackall <mpm@selenic.com>
      Acked-by: default avatarJoern Engel <joern@logfs.org>
      Cc: Marcin Slusarz <marcin.slusarz@gmail.com>
      Cc: Eric Dumazet <eric.dumazet@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      79824820
    • Akinobu Mita's avatar
      mm/debug-pagealloc.c: use plain __ratelimit() instead of printk_ratelimit() · 77311139
      Akinobu Mita authored
      printk_ratelimit() should not be used, because it shares ratelimiting
      state with all other unrelated printk_ratelimit() callsites.
      Signed-off-by: default avatarAkinobu Mita <akinobu.mita@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      77311139
    • Shaohua Li's avatar
      vmscan: count pages into balanced for zone with good watermark · 16fb9512
      Shaohua Li authored
      It's possible a zone watermark is ok when entering the balance_pgdat()
      loop, while the zone is within the requested classzone_idx.  Count pages
      from this zone into `balanced'.  In this way, we can skip shrinking zones
      too much for high order allocation.
      Signed-off-by: default avatarShaohua Li <shaohua.li@intel.com>
      Acked-by: default avatarMel Gorman <mgorman@suse.de>
      Reviewed-by: default avatarMinchan Kim <minchan.kim@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      16fb9512
    • Mel Gorman's avatar
      mm: vmscan: immediately reclaim end-of-LRU dirty pages when writeback completes · 49ea7eb6
      Mel Gorman authored
      When direct reclaim encounters a dirty page, it gets recycled around the
      LRU for another cycle.  This patch marks the page PageReclaim similar to
      deactivate_page() so that the page gets reclaimed almost immediately after
      the page gets cleaned.  This is to avoid reclaiming clean pages that are
      younger than a dirty page encountered at the end of the LRU that might
      have been something like a use-once page.
      Signed-off-by: default avatarMel Gorman <mgorman@suse.de>
      Acked-by: default avatarJohannes Weiner <jweiner@redhat.com>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Christoph Hellwig <hch@infradead.org>
      Cc: Wu Fengguang <fengguang.wu@intel.com>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Minchan Kim <minchan.kim@gmail.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Alex Elder <aelder@sgi.com>
      Cc: Theodore Ts'o <tytso@mit.edu>
      Cc: Chris Mason <chris.mason@oracle.com>
      Cc: Dave Hansen <dave@linux.vnet.ibm.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      49ea7eb6
    • Mel Gorman's avatar
      mm: vmscan: throttle reclaim if encountering too many dirty pages under writeback · 92df3a72
      Mel Gorman authored
      Workloads that are allocating frequently and writing files place a large
      number of dirty pages on the LRU.  With use-once logic, it is possible for
      them to reach the end of the LRU quickly requiring the reclaimer to scan
      more to find clean pages.  Ordinarily, processes that are dirtying memory
      will get throttled by dirty balancing but this is a global heuristic and
      does not take into account that LRUs are maintained on a per-zone basis.
      This can lead to a situation whereby reclaim is scanning heavily, skipping
      over a large number of pages under writeback and recycling them around the
      LRU consuming CPU.
      
      This patch checks how many of the number of pages isolated from the LRU
      were dirty and under writeback.  If a percentage of them under writeback,
      the process will be throttled if a backing device or the zone is
      congested.  Note that this applies whether it is anonymous or file-backed
      pages that are under writeback meaning that swapping is potentially
      throttled.  This is intentional due to the fact if the swap device is
      congested, scanning more pages and dispatching more IO is not going to
      help matters.
      
      The percentage that must be in writeback depends on the priority.  At
      default priority, all of them must be dirty.  At DEF_PRIORITY-1, 50% of
      them must be, DEF_PRIORITY-2, 25% etc.  i.e.  as pressure increases the
      greater the likelihood the process will get throttled to allow the flusher
      threads to make some progress.
      Signed-off-by: default avatarMel Gorman <mgorman@suse.de>
      Reviewed-by: default avatarMinchan Kim <minchan.kim@gmail.com>
      Acked-by: default avatarJohannes Weiner <jweiner@redhat.com>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Christoph Hellwig <hch@infradead.org>
      Cc: Wu Fengguang <fengguang.wu@intel.com>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Alex Elder <aelder@sgi.com>
      Cc: Theodore Ts'o <tytso@mit.edu>
      Cc: Chris Mason <chris.mason@oracle.com>
      Cc: Dave Hansen <dave@linux.vnet.ibm.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      92df3a72
    • Mel Gorman's avatar
      mm: vmscan: do not writeback filesystem pages in kswapd except in high priority · f84f6e2b
      Mel Gorman authored
      It is preferable that no dirty pages are dispatched for cleaning from the
      page reclaim path.  At normal priorities, this patch prevents kswapd
      writing pages.
      
      However, page reclaim does have a requirement that pages be freed in a
      particular zone.  If it is failing to make sufficient progress (reclaiming
      < SWAP_CLUSTER_MAX at any priority priority), the priority is raised to
      scan more pages.  A priority of DEF_PRIORITY - 3 is considered to be the
      point where kswapd is getting into trouble reclaiming pages.  If this
      priority is reached, kswapd will dispatch pages for writing.
      Signed-off-by: default avatarMel Gorman <mgorman@suse.de>
      Reviewed-by: default avatarMinchan Kim <minchan.kim@gmail.com>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Christoph Hellwig <hch@infradead.org>
      Cc: Johannes Weiner <jweiner@redhat.com>
      Cc: Wu Fengguang <fengguang.wu@intel.com>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Alex Elder <aelder@sgi.com>
      Cc: Theodore Ts'o <tytso@mit.edu>
      Cc: Chris Mason <chris.mason@oracle.com>
      Cc: Dave Hansen <dave@linux.vnet.ibm.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      f84f6e2b
    • Mel Gorman's avatar
      ext4: warn if direct reclaim tries to writeback pages · 966dbde2
      Mel Gorman authored
      Direct reclaim should never writeback pages.  Warn if an attempt is made.
      Signed-off-by: default avatarMel Gorman <mgorman@suse.de>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Christoph Hellwig <hch@infradead.org>
      Cc: Johannes Weiner <jweiner@redhat.com>
      Cc: Wu Fengguang <fengguang.wu@intel.com>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Minchan Kim <minchan.kim@gmail.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Alex Elder <aelder@sgi.com>
      Cc: Theodore Ts'o <tytso@mit.edu>
      Cc: Chris Mason <chris.mason@oracle.com>
      Cc: Dave Hansen <dave@linux.vnet.ibm.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      966dbde2
    • Mel Gorman's avatar
      xfs: warn if direct reclaim tries to writeback pages · 94054fa3
      Mel Gorman authored
      Direct reclaim should never writeback pages.  For now, handle the
      situation and warn about it.  Ultimately, this will be a BUG_ON.
      Signed-off-by: default avatarMel Gorman <mgorman@suse.de>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Christoph Hellwig <hch@infradead.org>
      Cc: Johannes Weiner <jweiner@redhat.com>
      Cc: Wu Fengguang <fengguang.wu@intel.com>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Minchan Kim <minchan.kim@gmail.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Alex Elder <aelder@sgi.com>
      Cc: Theodore Ts'o <tytso@mit.edu>
      Cc: Chris Mason <chris.mason@oracle.com>
      Cc: Dave Hansen <dave@linux.vnet.ibm.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      94054fa3
    • Mel Gorman's avatar
      mm: vmscan: remove dead code related to lumpy reclaim waiting on pages under writeback · a18bba06
      Mel Gorman authored
      Lumpy reclaim worked with two passes - the first which queued pages for IO
      and the second which waited on writeback.  As direct reclaim can no longer
      write pages there is some dead code.  This patch removes it but direct
      reclaim will continue to wait on pages under writeback while in
      synchronous reclaim mode.
      Signed-off-by: default avatarMel Gorman <mgorman@suse.de>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Christoph Hellwig <hch@infradead.org>
      Cc: Johannes Weiner <jweiner@redhat.com>
      Cc: Wu Fengguang <fengguang.wu@intel.com>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Minchan Kim <minchan.kim@gmail.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Alex Elder <aelder@sgi.com>
      Cc: Theodore Ts'o <tytso@mit.edu>
      Cc: Chris Mason <chris.mason@oracle.com>
      Cc: Dave Hansen <dave@linux.vnet.ibm.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      a18bba06
    • Mel Gorman's avatar
      mm: vmscan: do not writeback filesystem pages in direct reclaim · ee72886d
      Mel Gorman authored
      Testing from the XFS folk revealed that there is still too much I/O from
      the end of the LRU in kswapd.  Previously it was considered acceptable by
      VM people for a small number of pages to be written back from reclaim with
      testing generally showing about 0.3% of pages reclaimed were written back
      (higher if memory was low).  That writing back a small number of pages is
      ok has been heavily disputed for quite some time and Dave Chinner
      explained it well;
      
      	It doesn't have to be a very high number to be a problem. IO
      	is orders of magnitude slower than the CPU time it takes to
      	flush a page, so the cost of making a bad flush decision is
      	very high. And single page writeback from the LRU is almost
      	always a bad flush decision.
      
      To complicate matters, filesystems respond very differently to requests
      from reclaim according to Christoph Hellwig;
      
      	xfs tries to write it back if the requester is kswapd
      	ext4 ignores the request if it's a delayed allocation
      	btrfs ignores the request
      
      As a result, each filesystem has different performance characteristics
      when under memory pressure and there are many pages being dirtied.  In
      some cases, the request is ignored entirely so the VM cannot depend on the
      IO being dispatched.
      
      The objective of this series is to reduce writing of filesystem-backed
      pages from reclaim, play nicely with writeback that is already in progress
      and throttle reclaim appropriately when writeback pages are encountered.
      The assumption is that the flushers will always write pages faster than if
      reclaim issues the IO.
      
      A secondary goal is to avoid the problem whereby direct reclaim splices
      two potentially deep call stacks together.
      
      There is a potential new problem as reclaim has less control over how long
      before a page in a particularly zone or container is cleaned and direct
      reclaimers depend on kswapd or flusher threads to do the necessary work.
      However, as filesystems sometimes ignore direct reclaim requests already,
      it is not expected to be a serious issue.
      
      Patch 1 disables writeback of filesystem pages from direct reclaim
      	entirely. Anonymous pages are still written.
      
      Patch 2 removes dead code in lumpy reclaim as it is no longer able
      	to synchronously write pages. This hurts lumpy reclaim but
      	there is an expectation that compaction is used for hugepage
      	allocations these days and lumpy reclaim's days are numbered.
      
      Patches 3-4 add warnings to XFS and ext4 if called from
      	direct reclaim. With patch 1, this "never happens" and is
      	intended to catch regressions in this logic in the future.
      
      Patch 5 disables writeback of filesystem pages from kswapd unless
      	the priority is raised to the point where kswapd is considered
      	to be in trouble.
      
      Patch 6 throttles reclaimers if too many dirty pages are being
      	encountered and the zones or backing devices are congested.
      
      Patch 7 invalidates dirty pages found at the end of the LRU so they
      	are reclaimed quickly after being written back rather than
      	waiting for a reclaimer to find them
      
      I consider this series to be orthogonal to the writeback work but it is
      worth noting that the writeback work affects the viability of patch 8 in
      particular.
      
      I tested this on ext4 and xfs using fs_mark, a simple writeback test based
      on dd and a micro benchmark that does a streaming write to a large mapping
      (exercises use-once LRU logic) followed by streaming writes to a mix of
      anonymous and file-backed mappings.  The command line for fs_mark when
      botted with 512M looked something like
      
      ./fs_mark -d  /tmp/fsmark-2676  -D  100  -N  150  -n  150  -L  25  -t  1  -S0  -s  10485760
      
      The number of files was adjusted depending on the amount of available
      memory so that the files created was about 3xRAM.  For multiple threads,
      the -d switch is specified multiple times.
      
      The test machine is x86-64 with an older generation of AMD processor with
      4 cores.  The underlying storage was 4 disks configured as RAID-0 as this
      was the best configuration of storage I had available.  Swap is on a
      separate disk.  Dirty ratio was tuned to 40% instead of the default of
      20%.
      
      Testing was run with and without monitors to both verify that the patches
      were operating as expected and that any performance gain was real and not
      due to interference from monitors.
      
      Here is a summary of results based on testing XFS.
      
      512M1P-xfs           Files/s  mean                 32.69 ( 0.00%)     34.44 ( 5.08%)
      512M1P-xfs           Elapsed Time fsmark                    51.41     48.29
      512M1P-xfs           Elapsed Time simple-wb                114.09    108.61
      512M1P-xfs           Elapsed Time mmap-strm                113.46    109.34
      512M1P-xfs           Kswapd efficiency fsmark                 62%       63%
      512M1P-xfs           Kswapd efficiency simple-wb              56%       61%
      512M1P-xfs           Kswapd efficiency mmap-strm              44%       42%
      512M-xfs             Files/s  mean                 30.78 ( 0.00%)     35.94 (14.36%)
      512M-xfs             Elapsed Time fsmark                    56.08     48.90
      512M-xfs             Elapsed Time simple-wb                112.22     98.13
      512M-xfs             Elapsed Time mmap-strm                219.15    196.67
      512M-xfs             Kswapd efficiency fsmark                 54%       56%
      512M-xfs             Kswapd efficiency simple-wb              54%       55%
      512M-xfs             Kswapd efficiency mmap-strm              45%       44%
      512M-4X-xfs          Files/s  mean                 30.31 ( 0.00%)     33.33 ( 9.06%)
      512M-4X-xfs          Elapsed Time fsmark                    63.26     55.88
      512M-4X-xfs          Elapsed Time simple-wb                100.90     90.25
      512M-4X-xfs          Elapsed Time mmap-strm                261.73    255.38
      512M-4X-xfs          Kswapd efficiency fsmark                 49%       50%
      512M-4X-xfs          Kswapd efficiency simple-wb              54%       56%
      512M-4X-xfs          Kswapd efficiency mmap-strm              37%       36%
      512M-16X-xfs         Files/s  mean                 60.89 ( 0.00%)     65.22 ( 6.64%)
      512M-16X-xfs         Elapsed Time fsmark                    67.47     58.25
      512M-16X-xfs         Elapsed Time simple-wb                103.22     90.89
      512M-16X-xfs         Elapsed Time mmap-strm                237.09    198.82
      512M-16X-xfs         Kswapd efficiency fsmark                 45%       46%
      512M-16X-xfs         Kswapd efficiency simple-wb              53%       55%
      512M-16X-xfs         Kswapd efficiency mmap-strm              33%       33%
      
      Up until 512-4X, the FSmark improvements were statistically significant.
      For the 4X and 16X tests the results were within standard deviations but
      just barely.  The time to completion for all tests is improved which is an
      important result.  In general, kswapd efficiency is not affected by
      skipping dirty pages.
      
      1024M1P-xfs          Files/s  mean                 39.09 ( 0.00%)     41.15 ( 5.01%)
      1024M1P-xfs          Elapsed Time fsmark                    84.14     80.41
      1024M1P-xfs          Elapsed Time simple-wb                210.77    184.78
      1024M1P-xfs          Elapsed Time mmap-strm                162.00    160.34
      1024M1P-xfs          Kswapd efficiency fsmark                 69%       75%
      1024M1P-xfs          Kswapd efficiency simple-wb              71%       77%
      1024M1P-xfs          Kswapd efficiency mmap-strm              43%       44%
      1024M-xfs            Files/s  mean                 35.45 ( 0.00%)     37.00 ( 4.19%)
      1024M-xfs            Elapsed Time fsmark                    94.59     91.00
      1024M-xfs            Elapsed Time simple-wb                229.84    195.08
      1024M-xfs            Elapsed Time mmap-strm                405.38    440.29
      1024M-xfs            Kswapd efficiency fsmark                 79%       71%
      1024M-xfs            Kswapd efficiency simple-wb              74%       74%
      1024M-xfs            Kswapd efficiency mmap-strm              39%       42%
      1024M-4X-xfs         Files/s  mean                 32.63 ( 0.00%)     35.05 ( 6.90%)
      1024M-4X-xfs         Elapsed Time fsmark                   103.33     97.74
      1024M-4X-xfs         Elapsed Time simple-wb                204.48    178.57
      1024M-4X-xfs         Elapsed Time mmap-strm                528.38    511.88
      1024M-4X-xfs         Kswapd efficiency fsmark                 81%       70%
      1024M-4X-xfs         Kswapd efficiency simple-wb              73%       72%
      1024M-4X-xfs         Kswapd efficiency mmap-strm              39%       38%
      1024M-16X-xfs        Files/s  mean                 42.65 ( 0.00%)     42.97 ( 0.74%)
      1024M-16X-xfs        Elapsed Time fsmark                   103.11     99.11
      1024M-16X-xfs        Elapsed Time simple-wb                200.83    178.24
      1024M-16X-xfs        Elapsed Time mmap-strm                397.35    459.82
      1024M-16X-xfs        Kswapd efficiency fsmark                 84%       69%
      1024M-16X-xfs        Kswapd efficiency simple-wb              74%       73%
      1024M-16X-xfs        Kswapd efficiency mmap-strm              39%       40%
      
      All FSMark tests up to 16X had statistically significant improvements.
      For the most part, tests are completing faster with the exception of the
      streaming writes to a mixture of anonymous and file-backed mappings which
      were slower in two cases
      
      In the cases where the mmap-strm tests were slower, there was more
      swapping due to dirty pages being skipped.  The number of additional pages
      swapped is almost identical to the fewer number of pages written from
      reclaim.  In other words, roughly the same number of pages were reclaimed
      but swapping was slower.  As the test is a bit unrealistic and stresses
      memory heavily, the small shift is acceptable.
      
      4608M1P-xfs          Files/s  mean                 29.75 ( 0.00%)     30.96 ( 3.91%)
      4608M1P-xfs          Elapsed Time fsmark                   512.01    492.15
      4608M1P-xfs          Elapsed Time simple-wb                618.18    566.24
      4608M1P-xfs          Elapsed Time mmap-strm                488.05    465.07
      4608M1P-xfs          Kswapd efficiency fsmark                 93%       86%
      4608M1P-xfs          Kswapd efficiency simple-wb              88%       84%
      4608M1P-xfs          Kswapd efficiency mmap-strm              46%       45%
      4608M-xfs            Files/s  mean                 27.60 ( 0.00%)     28.85 ( 4.33%)
      4608M-xfs            Elapsed Time fsmark                   555.96    532.34
      4608M-xfs            Elapsed Time simple-wb                659.72    571.85
      4608M-xfs            Elapsed Time mmap-strm               1082.57   1146.38
      4608M-xfs            Kswapd efficiency fsmark                 89%       91%
      4608M-xfs            Kswapd efficiency simple-wb              88%       82%
      4608M-xfs            Kswapd efficiency mmap-strm              48%       46%
      4608M-4X-xfs         Files/s  mean                 26.00 ( 0.00%)     27.47 ( 5.35%)
      4608M-4X-xfs         Elapsed Time fsmark                   592.91    564.00
      4608M-4X-xfs         Elapsed Time simple-wb                616.65    575.07
      4608M-4X-xfs         Elapsed Time mmap-strm               1773.02   1631.53
      4608M-4X-xfs         Kswapd efficiency fsmark                 90%       94%
      4608M-4X-xfs         Kswapd efficiency simple-wb              87%       82%
      4608M-4X-xfs         Kswapd efficiency mmap-strm              43%       43%
      4608M-16X-xfs        Files/s  mean                 26.07 ( 0.00%)     26.42 ( 1.32%)
      4608M-16X-xfs        Elapsed Time fsmark                   602.69    585.78
      4608M-16X-xfs        Elapsed Time simple-wb                606.60    573.81
      4608M-16X-xfs        Elapsed Time mmap-strm               1549.75   1441.86
      4608M-16X-xfs        Kswapd efficiency fsmark                 98%       98%
      4608M-16X-xfs        Kswapd efficiency simple-wb              88%       82%
      4608M-16X-xfs        Kswapd efficiency mmap-strm              44%       42%
      
      Unlike the other tests, the fsmark results are not statistically
      significant but the min and max times are both improved and for the most
      part, tests completed faster.
      
      There are other indications that this is an improvement as well.  For
      example, in the vast majority of cases, there were fewer pages scanned by
      direct reclaim implying in many cases that stalls due to direct reclaim
      are reduced.  KSwapd is scanning more due to skipping dirty pages which is
      unfortunate but the CPU usage is still acceptable
      
      In an earlier set of tests, I used blktrace and in almost all cases
      throughput throughout the entire test was higher.  However, I ended up
      discarding those results as recording blktrace data was too heavy for my
      liking.
      
      On a laptop, I plugged in a USB stick and ran a similar tests of tests
      using it as backing storage.  A desktop environment was running and for
      the entire duration of the tests, firefox and gnome terminal were
      launching and exiting to vaguely simulate a user.
      
      1024M-xfs            Files/s  mean               0.41 ( 0.00%)        0.44 ( 6.82%)
      1024M-xfs            Elapsed Time fsmark               2053.52   1641.03
      1024M-xfs            Elapsed Time simple-wb            1229.53    768.05
      1024M-xfs            Elapsed Time mmap-strm            4126.44   4597.03
      1024M-xfs            Kswapd efficiency fsmark              84%       85%
      1024M-xfs            Kswapd efficiency simple-wb           92%       81%
      1024M-xfs            Kswapd efficiency mmap-strm           60%       51%
      1024M-xfs            Avg wait ms fsmark                5404.53     4473.87
      1024M-xfs            Avg wait ms simple-wb             2541.35     1453.54
      1024M-xfs            Avg wait ms mmap-strm             3400.25     3852.53
      
      The mmap-strm results were hurt because firefox launching had a tendency
      to push the test out of memory.  On the postive side, firefox launched
      marginally faster with the patches applied.  Time to completion for many
      tests was faster but more importantly - the "Avg wait" time as measured by
      iostat was far lower implying the system would be more responsive.  It was
      also the case that "Avg wait ms" on the root filesystem was lower.  I
      tested it manually and while the system felt slightly more responsive
      while copying data to a USB stick, it was marginal enough that it could be
      my imagination.
      
      This patch: do not writeback filesystem pages in direct reclaim.
      
      When kswapd is failing to keep zones above the min watermark, a process
      will enter direct reclaim in the same manner kswapd does.  If a dirty page
      is encountered during the scan, this page is written to backing storage
      using mapping->writepage.
      
      This causes two problems.  First, it can result in very deep call stacks,
      particularly if the target storage or filesystem are complex.  Some
      filesystems ignore write requests from direct reclaim as a result.  The
      second is that a single-page flush is inefficient in terms of IO.  While
      there is an expectation that the elevator will merge requests, this does
      not always happen.  Quoting Christoph Hellwig;
      
      	The elevator has a relatively small window it can operate on,
      	and can never fix up a bad large scale writeback pattern.
      
      This patch prevents direct reclaim writing back filesystem pages by
      checking if current is kswapd.  Anonymous pages are still written to swap
      as there is not the equivalent of a flusher thread for anonymous pages.
      If the dirty pages cannot be written back, they are placed back on the LRU
      lists.  There is now a direct dependency on dirty page balancing to
      prevent too many pages in the system being dirtied which would prevent
      reclaim making forward progress.
      Signed-off-by: default avatarMel Gorman <mgorman@suse.de>
      Reviewed-by: default avatarMinchan Kim <minchan.kim@gmail.com>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Christoph Hellwig <hch@infradead.org>
      Cc: Johannes Weiner <jweiner@redhat.com>
      Cc: Wu Fengguang <fengguang.wu@intel.com>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Alex Elder <aelder@sgi.com>
      Cc: Theodore Ts'o <tytso@mit.edu>
      Cc: Chris Mason <chris.mason@oracle.com>
      Cc: Dave Hansen <dave@linux.vnet.ibm.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      ee72886d
    • Christoph Lameter's avatar
      mm: add comments to explain mm_struct fields · e10d59f2
      Christoph Lameter authored
      Add comments to explain the page statistics field in the mm_struct.
      
      [akpm@linux-foundation.org: add missing ;]
      Signed-off-by: default avatarChristoph Lameter <cl@linux.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      e10d59f2
    • Christoph Lameter's avatar
      mm: distinguish between mlocked and pinned pages · bc3e53f6
      Christoph Lameter authored
      Some kernel components pin user space memory (infiniband and perf) (by
      increasing the page count) and account that memory as "mlocked".
      
      The difference between mlocking and pinning is:
      
      A. mlocked pages are marked with PG_mlocked and are exempt from
         swapping. Page migration may move them around though.
         They are kept on a special LRU list.
      
      B. Pinned pages cannot be moved because something needs to
         directly access physical memory. They may not be on any
         LRU list.
      
      I recently saw an mlockalled process where mm->locked_vm became
      bigger than the virtual size of the process (!) because some
      memory was accounted for twice:
      
      Once when the page was mlocked and once when the Infiniband
      layer increased the refcount because it needt to pin the RDMA
      memory.
      
      This patch introduces a separate counter for pinned pages and
      accounts them seperately.
      Signed-off-by: default avatarChristoph Lameter <cl@linux.com>
      Cc: Mike Marciniszyn <infinipath@qlogic.com>
      Cc: Roland Dreier <roland@kernel.org>
      Cc: Sean Hefty <sean.hefty@intel.com>
      Cc: Hugh Dickins <hughd@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      bc3e53f6
    • Johannes Weiner's avatar
      mm: vmscan: drop nr_force_scan[] from get_scan_count · f11c0ca5
      Johannes Weiner authored
      The nr_force_scan[] tuple holds the effective scan numbers for anon and
      file pages in case the situation called for a forced scan and the
      regularly calculated scan numbers turned out zero.
      
      However, the effective scan number can always be assumed to be
      SWAP_CLUSTER_MAX right before the division into anon and file.  The
      numerators and denominator are properly set up for all cases, be it force
      scan for just file, just anon, or both, to do the right thing.
      Signed-off-by: default avatarJohannes Weiner <jweiner@redhat.com>
      Reviewed-by: default avatarMinchan Kim <minchan.kim@gmail.com>
      Acked-by: default avatarKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Reviewed-by: default avatarMichal Hocko <mhocko@suse.cz>
      Cc: Ying Han <yinghan@google.com>
      Cc: Balbir Singh <bsingharora@gmail.com>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
      Acked-by: default avatarMel Gorman <mel@csn.ul.ie>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      f11c0ca5
    • Dave Jones's avatar
      mm: output a list of loaded modules when we hit bad_page() · 4f31888c
      Dave Jones authored
      When we get a bad_page bug report, it's useful to see what modules the
      user had loaded.
      Signed-off-by: default avatarDave Jones <davej@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      4f31888c
    • Robert P. J. Day's avatar
      tmpfs: add "tmpfs" to the Kconfig prompt to make it obvious. · f5fc870d
      Robert P. J. Day authored
      Add the leading word "tmpfs" to the Kconfig string to make it blindingly
      obvious that this selection refers to tmpfs.
      Signed-off-by: default avatarRobert P. J. Day <rpjday@crashcourse.ca>
      Acked-by: default avatarHugh Dickins <hughd@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      f5fc870d
    • David Rientjes's avatar
      oom: fix race while temporarily setting current's oom_score_adj · 43362a49
      David Rientjes authored
      test_set_oom_score_adj() was introduced in 72788c38 ("oom: replace
      PF_OOM_ORIGIN with toggling oom_score_adj") to temporarily elevate
      current's oom_score_adj for ksm and swapoff without requiring an
      additional per-process flag.
      
      Using that function to both set oom_score_adj to OOM_SCORE_ADJ_MAX and
      then reinstate the previous value is racy since it's possible that
      userspace can set the value to something else itself before the old value
      is reinstated.  That results in userspace setting current's oom_score_adj
      to a different value and then the kernel immediately setting it back to
      its previous value without notification.
      
      To fix this, a new compare_swap_oom_score_adj() function is introduced
      with the same semantics as the compare and swap CAS instruction, or
      CMPXCHG on x86.  It is used to reinstate the previous value of
      oom_score_adj if and only if the present value is the same as the old
      value.
      Signed-off-by: default avatarDavid Rientjes <rientjes@google.com>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Ying Han <yinghan@google.com>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      43362a49
    • David Rientjes's avatar
      oom: remove oom_disable_count · c9f01245
      David Rientjes authored
      This removes mm->oom_disable_count entirely since it's unnecessary and
      currently buggy.  The counter was intended to be per-process but it's
      currently decremented in the exit path for each thread that exits, causing
      it to underflow.
      
      The count was originally intended to prevent oom killing threads that
      share memory with threads that cannot be killed since it doesn't lead to
      future memory freeing.  The counter could be fixed to represent all
      threads sharing the same mm, but it's better to remove the count since:
      
       - it is possible that the OOM_DISABLE thread sharing memory with the
         victim is waiting on that thread to exit and will actually cause
         future memory freeing, and
      
       - there is no guarantee that a thread is disabled from oom killing just
         because another thread sharing its mm is oom disabled.
      Signed-off-by: default avatarDavid Rientjes <rientjes@google.com>
      Reported-by: default avatarOleg Nesterov <oleg@redhat.com>
      Reviewed-by: default avatarOleg Nesterov <oleg@redhat.com>
      Cc: Ying Han <yinghan@google.com>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      c9f01245
    • David Rientjes's avatar
      oom: avoid killing kthreads if they assume the oom killed thread's mm · 7b0d44fa
      David Rientjes authored
      After selecting a task to kill, the oom killer iterates all processes and
      kills all other threads that share the same mm_struct in different thread
      groups.  It would not otherwise be helpful to kill a thread if its memory
      would not be subsequently freed.
      
      A kernel thread, however, may assume a user thread's mm by using
      use_mm().  This is only temporary and should not result in sending a
      SIGKILL to that kthread.
      
      This patch ensures that only user threads and not kthreads are sent a
      SIGKILL if they share the same mm_struct as the oom killed task.
      Signed-off-by: default avatarDavid Rientjes <rientjes@google.com>
      Reviewed-by: default avatarMichal Hocko <mhocko@suse.cz>
      Reviewed-by: default avatarKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      7b0d44fa
    • David Rientjes's avatar
      oom: thaw threads if oom killed thread is frozen before deferring · f660daac
      David Rientjes authored
      If a thread has been oom killed and is frozen, thaw it before returning to
      the page allocator.  Otherwise, it can stay frozen indefinitely and no
      memory will be freed.
      Signed-off-by: default avatarDavid Rientjes <rientjes@google.com>
      Reported-by: default avatarKonstantin Khlebnikov <khlebnikov@openvz.org>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: "Rafael J. Wysocki" <rjw@sisk.pl>
      Acked-by: default avatarMichal Hocko <mhocko@suse.cz>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      f660daac
    • Johannes Weiner's avatar
      mm/page-writeback.c: document bdi_min_ratio · d08c429b
      Johannes Weiner authored
      Looks like someone got distracted after adding the comment characters.
      Signed-off-by: default avatarJohannes Weiner <jweiner@redhat.com>
      Acked-by: default avatarPeter Zijlstra <peterz@infradead.org>
      Cc: Wu Fengguang <fengguang.wu@intel.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      d08c429b
    • Shaohua Li's avatar
      vmscan: add block plug for page reclaim · 3da367c3
      Shaohua Li authored
      per-task block plug can reduce block queue lock contention and increase
      request merge.  Currently page reclaim doesn't support it.  I originally
      thought page reclaim doesn't need it, because kswapd thread count is
      limited and file cache write is done at flusher mostly.
      
      When I test a workload with heavy swap in a 4-node machine, each CPU is
      doing direct page reclaim and swap.  This causes block queue lock
      contention.  In my test, without below patch, the CPU utilization is about
      2% ~ 7%.  With the patch, the CPU utilization is about 1% ~ 3%.  Disk
      throughput isn't changed.  This should improve normal kswapd write and
      file cache write too (increase request merge for example), but might not
      be so obvious as I explain above.
      Signed-off-by: default avatarShaohua Li <shaohua.li@intel.com>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Minchan Kim <minchan.kim@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      3da367c3
    • Hugh Dickins's avatar
      radix_tree: clean away saw_unset_tag leftovers · 3fa36acb
      Hugh Dickins authored
      radix_tree_tag_get()'s BUG (when it sees a tag after saw_unset_tag) was
      unsafe and removed in 2.6.34, but the pointless saw_unset_tag left behind.
      
      Remove it now, and return 0 as soon as we see unset tag - we already rely
      upon the root tag to be correct, returning 0 immediately if it's not set.
      Signed-off-by: default avatarHugh Dickins <hughd@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      3fa36acb
    • Minchan Kim's avatar
      mm: migration: clean up unmap_and_move() · 0dabec93
      Minchan Kim authored
      unmap_and_move() is one a big messy function.  Clean it up.
      Signed-off-by: default avatarMinchan Kim <minchan.kim@gmail.com>
      Reviewed-by: default avatarKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      0dabec93
    • Minchan Kim's avatar
      mm: zone_reclaim: make isolate_lru_page() filter-aware · f80c0673
      Minchan Kim authored
      In __zone_reclaim case, we don't want to shrink mapped page.  Nonetheless,
      we have isolated mapped page and re-add it into LRU's head.  It's
      unnecessary CPU overhead and makes LRU churning.
      
      Of course, when we isolate the page, the page might be mapped but when we
      try to migrate the page, the page would be not mapped.  So it could be
      migrated.  But race is rare and although it happens, it's no big deal.
      Signed-off-by: default avatarMinchan Kim <minchan.kim@gmail.com>
      Acked-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Reviewed-by: default avatarKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Reviewed-by: default avatarKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Reviewed-by: default avatarMichal Hocko <mhocko@suse.cz>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      f80c0673
    • Minchan Kim's avatar
      mm: compaction: make isolate_lru_page() filter-aware · 39deaf85
      Minchan Kim authored
      In async mode, compaction doesn't migrate dirty or writeback pages.  So,
      it's meaningless to pick the page and re-add it to lru list.
      
      Of course, when we isolate the page in compaction, the page might be dirty
      or writeback but when we try to migrate the page, the page would be not
      dirty, writeback.  So it could be migrated.  But it's very unlikely as
      isolate and migration cycle is much faster than writeout.
      
      So, this patch helps cpu overhead and prevent unnecessary LRU churning.
      Signed-off-by: default avatarMinchan Kim <minchan.kim@gmail.com>
      Acked-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Reviewed-by: default avatarKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Reviewed-by: default avatarKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Acked-by: default avatarMel Gorman <mgorman@suse.de>
      Acked-by: default avatarRik van Riel <riel@redhat.com>
      Reviewed-by: default avatarMichal Hocko <mhocko@suse.cz>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      39deaf85
    • Minchan Kim's avatar
      mm: change isolate mode from #define to bitwise type · 4356f21d
      Minchan Kim authored
      Change ISOLATE_XXX macro with bitwise isolate_mode_t type.  Normally,
      macro isn't recommended as it's type-unsafe and making debugging harder as
      symbol cannot be passed throught to the debugger.
      
      Quote from Johannes
      " Hmm, it would probably be cleaner to fully convert the isolation mode
      into independent flags.  INACTIVE, ACTIVE, BOTH is currently a
      tri-state among flags, which is a bit ugly."
      
      This patch moves isolate mode from swap.h to mmzone.h by memcontrol.h
      Signed-off-by: default avatarMinchan Kim <minchan.kim@gmail.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      4356f21d
    • Minchan Kim's avatar
      mm: compaction: trivial clean up in acct_isolated() · b9e84ac1
      Minchan Kim authored
      acct_isolated of compaction uses page_lru_base_type which returns only
      base type of LRU list so it never returns LRU_ACTIVE_ANON or
      LRU_ACTIVE_FILE.  In addtion, cc->nr_[anon|file] is used in only
      acct_isolated so it doesn't have fields in conpact_control.
      
      This patch removes fields from compact_control and makes clear function of
      acct_issolated which counts the number of anon|file pages isolated.
      Signed-off-by: default avatarMinchan Kim <minchan.kim@gmail.com>
      Acked-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Reviewed-by: default avatarKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Reviewed-by: default avatarKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Acked-by: default avatarMel Gorman <mgorman@suse.de>
      Acked-by: default avatarRik van Riel <riel@redhat.com>
      Reviewed-by: default avatarMichal Hocko <mhocko@suse.cz>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      b9e84ac1
    • Christopher Yeoh's avatar
      Cross Memory Attach · fcf63409
      Christopher Yeoh authored
      The basic idea behind cross memory attach is to allow MPI programs doing
      intra-node communication to do a single copy of the message rather than a
      double copy of the message via shared memory.
      
      The following patch attempts to achieve this by allowing a destination
      process, given an address and size from a source process, to copy memory
      directly from the source process into its own address space via a system
      call.  There is also a symmetrical ability to copy from the current
      process's address space into a destination process's address space.
      
      - Use of /proc/pid/mem has been considered, but there are issues with
        using it:
        - Does not allow for specifying iovecs for both src and dest, assuming
          preadv or pwritev was implemented either the area read from or
        written to would need to be contiguous.
        - Currently mem_read allows only processes who are currently
        ptrace'ing the target and are still able to ptrace the target to read
        from the target. This check could possibly be moved to the open call,
        but its not clear exactly what race this restriction is stopping
        (reason  appears to have been lost)
        - Having to send the fd of /proc/self/mem via SCM_RIGHTS on unix
        domain socket is a bit ugly from a userspace point of view,
        especially when you may have hundreds if not (eventually) thousands
        of processes  that all need to do this with each other
        - Doesn't allow for some future use of the interface we would like to
        consider adding in the future (see below)
        - Interestingly reading from /proc/pid/mem currently actually
        involves two copies! (But this could be fixed pretty easily)
      
      As mentioned previously use of vmsplice instead was considered, but has
      problems.  Since you need the reader and writer working co-operatively if
      the pipe is not drained then you block.  Which requires some wrapping to
      do non blocking on the send side or polling on the receive.  In all to all
      communication it requires ordering otherwise you can deadlock.  And in the
      example of many MPI tasks writing to one MPI task vmsplice serialises the
      copying.
      
      There are some cases of MPI collectives where even a single copy interface
      does not get us the performance gain we could.  For example in an
      MPI_Reduce rather than copy the data from the source we would like to
      instead use it directly in a mathops (say the reduce is doing a sum) as
      this would save us doing a copy.  We don't need to keep a copy of the data
      from the source.  I haven't implemented this, but I think this interface
      could in the future do all this through the use of the flags - eg could
      specify the math operation and type and the kernel rather than just
      copying the data would apply the specified operation between the source
      and destination and store it in the destination.
      
      Although we don't have a "second user" of the interface (though I've had
      some nibbles from people who may be interested in using it for intra
      process messaging which is not MPI).  This interface is something which
      hardware vendors are already doing for their custom drivers to implement
      fast local communication.  And so in addition to this being useful for
      OpenMPI it would mean the driver maintainers don't have to fix things up
      when the mm changes.
      
      There was some discussion about how much faster a true zero copy would
      go. Here's a link back to the email with some testing I did on that:
      
      http://marc.info/?l=linux-mm&m=130105930902915&w=2
      
      There is a basic man page for the proposed interface here:
      
      http://ozlabs.org/~cyeoh/cma/process_vm_readv.txt
      
      This has been implemented for x86 and powerpc, other architecture should
      mainly (I think) just need to add syscall numbers for the process_vm_readv
      and process_vm_writev. There are 32 bit compatibility versions for
      64-bit kernels.
      
      For arch maintainers there are some simple tests to be able to quickly
      verify that the syscalls are working correctly here:
      
      http://ozlabs.org/~cyeoh/cma/cma-test-20110718.tgzSigned-off-by: default avatarChris Yeoh <yeohc@au1.ibm.com>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: David Howells <dhowells@redhat.com>
      Cc: James Morris <jmorris@namei.org>
      Cc: <linux-man@vger.kernel.org>
      Cc: <linux-arch@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      fcf63409
    • Wanlong Gao's avatar
      ipc/mqueue.c: fix wrong use of schedule_hrtimeout_range_clock() · 32ea845d
      Wanlong Gao authored
      Fix the wrong use of schedule_hrtimeout_range_clock() in wq_sleep(),
      although it is harmless for the syscall mq_timed* now.  It was introduced
      by 9ca7d8e6 ("mqueue: Convert message queue timeout to use hrtimers").
      Signed-off-by: default avatarWanlong Gao <gaowanlong@cn.fujitsu.com>
      Cc: Carsten Emde <C.Emde@osadl.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Manfred Spraul <manfred@colorfullife.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      32ea845d