1. 11 Sep, 2013 40 commits
    • Toshi Kani's avatar
      mm/hotplug: verify hotplug memory range · 27356f54
      Toshi Kani authored
      add_memory() and remove_memory() can only handle a memory range aligned
      with section.  There are problems when an unaligned range is added and
      then deleted as follows:
      
       - add_memory() with an unaligned range succeeds, but __add_pages()
         called from add_memory() adds a whole section of pages even though
         a given memory range is less than the section size.
       - remove_memory() to the added unaligned range hits BUG_ON() in
         __remove_pages().
      
      This patch changes add_memory() and remove_memory() to check if a given
      memory range is aligned with section at the beginning.  As the result,
      add_memory() fails with -EINVAL when a given range is unaligned, and does
      not add such memory range.  This prevents remove_memory() to be called
      with an unaligned range as well.  Note that remove_memory() has to use
      BUG_ON() since this function cannot fail.
      
      [akpm@linux-foundation.org: avoid printk warnings]
      Signed-off-by: default avatarToshi Kani <toshi.kani@hp.com>
      Acked-by: default avatarKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Reviewed-by: default avatarTang Chen <tangchen@cn.fujitsu.com>
      Reviewed-by: default avatarWanpeng Li <liwanp@linux.vnet.ibm.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      27356f54
    • Davidlohr Bueso's avatar
      hugepage: mention libhugetlbfs in doc · 15610c86
      Davidlohr Bueso authored
      Explicitly mention/recommend using the libhugetlbfs test cases when
      changing related kernel code.  Developers that are unaware of the project
      can easily miss this and introduce potential regressions that may or may
      not be caught by community review.
      
      Also do some cleanups that make the document visually easier to view at a
      first glance.
      Signed-off-by: default avatarDavidlohr Bueso <davidlohr@hp.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      15610c86
    • Fengguang Wu's avatar
      readahead: make context readahead more conservative · 2cad4018
      Fengguang Wu authored
      This helps performance on moderately dense random reads on SSD.
      
      Transaction-Per-Second numbers provided by Taobao:
      
      		QPS	case
      		-------------------------------------------------------
      		7536	disable context readahead totally
      w/ patch:	7129	slower size rampup and start RA on the 3rd read
      		6717	slower size rampup
      w/o patch:	5581	unmodified context readahead
      
      Before, readahead will be started whenever reading page N+1 when it happen
      to read N recently.  After patch, we'll only start readahead when *three*
      random reads happen to access pages N, N+1, N+2.  The probability of this
      happening is extremely low for pure random reads, unless they are very
      dense, which actually deserves some readahead.
      
      Also start with a smaller readahead window.  The impact to interleaved
      sequential reads should be small, because for a long run stream, the the
      small readahead window rampup phase is negletable.
      
      The context readahead actually benefits clustered random reads on HDD
      whose seek cost is pretty high.  However as SSD is increasingly used for
      random read workloads it's better for the context readahead to concentrate
      on interleaved sequential reads.
      
      Another SSD rand read test from Miao
      
              # file size:        2GB
              # read IO amount: 625MB
              sysbench --test=fileio          \
                      --max-requests=10000    \
                      --num-threads=1         \
                      --file-num=1            \
                      --file-block-size=64K   \
                      --file-test-mode=rndrd  \
                      --file-fsync-freq=0     \
                      --file-fsync-end=off    run
      
      shows the performance of btrfs grows up from 69MB/s to 121MB/s, ext4 from
      104MB/s to 121MB/s.
      Signed-off-by: default avatarWu Fengguang <fengguang.wu@intel.com>
      Tested-by: default avatarTao Ma <tm@tao.ma>
      Tested-by: default avatarMiao Xie <miaox@cn.fujitsu.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      2cad4018
    • Xishi Qiu's avatar
      mm: use zone_is_initialized() instead of if(zone->wait_table) · 139c2d75
      Xishi Qiu authored
      Use "zone_is_initialized()" instead of "if (zone->wait_table)".
      Simplify the code, no functional change.
      Signed-off-by: default avatarXishi Qiu <qiuxishi@huawei.com>
      Cc: Cody P Schafer <cody@linux.vnet.ibm.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      139c2d75
    • Xishi Qiu's avatar
      mm: use zone_is_empty() instead of if(zone->spanned_pages) · 8080fc03
      Xishi Qiu authored
      Use "zone_is_empty()" instead of "if (zone->spanned_pages)".
      Simplify the code, no functional change.
      Signed-off-by: default avatarXishi Qiu <qiuxishi@huawei.com>
      Cc: Cody P Schafer <cody@linux.vnet.ibm.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      8080fc03
    • Xishi Qiu's avatar
      mm: use zone_end_pfn() instead of zone_start_pfn+spanned_pages · c33bc315
      Xishi Qiu authored
      Use "zone_end_pfn()" instead of "zone->zone_start_pfn + zone->spanned_pages".
      Simplify the code, no functional change.
      
      [akpm@linux-foundation.org: fix build]
      Signed-off-by: default avatarXishi Qiu <qiuxishi@huawei.com>
      Cc: Cody P Schafer <cody@linux.vnet.ibm.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      c33bc315
    • Joonyoung Shim's avatar
      lib/genalloc.c: fix overflow of ending address of memory chunk · 674470d9
      Joonyoung Shim authored
      In struct gen_pool_chunk, end_addr means the end address of memory chunk
      (inclusive), but in the implementation it is treated as address + size of
      memory chunk (exclusive), so it points to the address plus one instead of
      correct ending address.
      
      The ending address of memory chunk plus one will cause overflow on the
      memory chunk including the last address of memory map, e.g.  when starting
      address is 0xFFF00000 and size is 0x100000 on 32bit machine, ending
      address will be 0x100000000.
      
      Use correct ending address like starting address + size - 1.
      
      [akpm@linux-foundation.org: add comment to struct gen_pool_chunk:end_addr]
      Signed-off-by: default avatarJoonyoung Shim <jy0922.shim@samsung.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      674470d9
    • Jianguo Wu's avatar
      mm/zbud: fix some trivial typos in comments · eee87e17
      Jianguo Wu authored
      Signed-off-by: default avatarJianguo Wu <wujianguo@huawei.com>
      Cc: Seth Jennings <sjenning@linux.vnet.ibm.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      eee87e17
    • Xishi Qiu's avatar
      mm/hotplug: remove unnecessary BUG_ON in __offline_pages() · 37b000b6
      Xishi Qiu authored
      I think we can remove "BUG_ON(start_pfn >= end_pfn)" in __offline_pages(),
      because in memory_block_action() "nr_pages = PAGES_PER_SECTION * sections_per_block"
      is always greater than 0.
      
      memory_block_action()
      	offline_pages()
      		__offline_pages()
      			BUG_ON(start_pfn >= end_pfn)
      
      In v2.6.32, If info->length==0, this way may hit this BUG_ON().
      acpi_memory_disable_device()
      	remove_memory(info->start_addr, info->length)
      			offline_pages()
      
      A later Fujitsu patch renamed this function and the BUG_ON() is
      unnecessary.
      Signed-off-by: default avatarXishi Qiu <qiuxishi@huawei.com>
      Reviewed-by: default avatarDave Hansen <dave.hansen@linux.intel.com>
      Cc: Toshi Kani <toshi.kani@hp.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      37b000b6
    • Joonsoo Kim's avatar
      mm, vmalloc: use well-defined find_last_bit() func · b136be5e
      Joonsoo Kim authored
      Our intention in here is to find last_bit within the region to flush.
      There is well-defined function, find_last_bit() for this purpose and its
      performance may be slightly better than current implementation.  So change
      it.
      Signed-off-by: default avatarJoonsoo Kim <iamjoonsoo.kim@lge.com>
      Reviewed-by: default avatarWanpeng Li <liwanp@linux.vnet.ibm.com>
      Acked-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: default avatarZhang Yanfei <zhangyanfei@cn.fujitsu.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      b136be5e
    • Joonsoo Kim's avatar
      6b70f7df
    • Christoph Lameter's avatar
      vmstat: use this_cpu() to avoid irqon/off sequence in refresh_cpu_vm_stats · fbc2edb0
      Christoph Lameter authored
      Disabling interrupts repeatedly can be avoided in the inner loop if we use
      a this_cpu operation.
      Signed-off-by: default avatarChristoph Lameter <cl@linux.com>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      CC: Tejun Heo <tj@kernel.org>
      Cc: Joonsoo Kim <js1304@gmail.com>
      Cc: Alexey Dobriyan <adobriyan@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      fbc2edb0
    • Christoph Lameter's avatar
      vmstat: create fold_diff · 4edb0748
      Christoph Lameter authored
      Both functions that update global counters use the same mechanism.
      
      Create a function that contains the common code.
      Signed-off-by: default avatarChristoph Lameter <cl@linux.com>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      CC: Tejun Heo <tj@kernel.org>
      Cc: Joonsoo Kim <js1304@gmail.com>
      Cc: Alexey Dobriyan <adobriyan@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      4edb0748
    • Christoph Lameter's avatar
      vmstat: create separate function to fold per cpu diffs into local counters · 2bb921e5
      Christoph Lameter authored
      The main idea behind this patchset is to reduce the vmstat update overhead
      by avoiding interrupt enable/disable and the use of per cpu atomics.
      
      This patch (of 3):
      
      It is better to have a separate folding function because
      refresh_cpu_vm_stats() also does other things like expire pages in the
      page allocator caches.
      
      If we have a separate function then refresh_cpu_vm_stats() is only called
      from the local cpu which allows additional optimizations.
      
      The folding function is only called when a cpu is being downed and
      therefore no other processor will be accessing the counters.  Also
      simplifies synchronization.
      
      [akpm@linux-foundation.org: fix UP build]
      Signed-off-by: default avatarChristoph Lameter <cl@linux.com>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      CC: Tejun Heo <tj@kernel.org>
      Cc: Joonsoo Kim <js1304@gmail.com>
      Cc: Alexey Dobriyan <adobriyan@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      2bb921e5
    • Joonsoo Kim's avatar
      swap: clean-up #ifdef in page_mapping() · d2cf5ad6
      Joonsoo Kim authored
      PageSwapCache() is always false when !CONFIG_SWAP, so compiler
      properly discard related code. Therefore, we don't need #ifdef explicitly.
      Signed-off-by: default avatarJoonsoo Kim <iamjoonsoo.kim@lge.com>
      Acked-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Rik van Riel <riel@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      d2cf5ad6
    • Joonsoo Kim's avatar
      mm: move pgtable related functions to right place · bc4b4448
      Joonsoo Kim authored
      pgtable related functions are mostly in pgtable-generic.c.
      So move remaining functions from memory.c to pgtable-generic.c.
      Signed-off-by: default avatarJoonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Rik van Riel <riel@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      bc4b4448
    • Joonsoo Kim's avatar
      mm, page_alloc: add unlikely macro to help compiler optimization · e66f0972
      Joonsoo Kim authored
      We rarely allocate a page with ALLOC_NO_WATERMARKS and it is used in slow
      path.  For helping compiler optimization, add unlikely macro to
      ALLOC_NO_WATERMARKS checking.
      
      This patch doesn't have any effect now, because gcc already optimize this
      properly.  But we cannot assume that gcc always does right and nobody
      re-evaluate if gcc do proper optimization with their change, for example,
      it is not optimized properly on v3.10.  So adding compiler hint here is
      reasonable.
      Signed-off-by: default avatarJoonsoo Kim <iamjoonsoo.kim@lge.com>
      Acked-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Rik van Riel <riel@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      e66f0972
    • Jianguo Wu's avatar
      mm/mempolicy: return NULL if node is NUMA_NO_NODE in get_task_policy · 1da6f0e1
      Jianguo Wu authored
      If node == NUMA_NO_NODE, pol is NULL, we should return NULL instead of
      do "if (!pol->mode)" check.
      
      [akpm@linux-foundation.org: reorganise code]
      Signed-off-by: default avatarJianguo Wu <wujianguo@huawei.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Hanjun Guo <guohanjun@huawei.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      1da6f0e1
    • Joonsoo Kim's avatar
      mm, hugetlb: decrement reserve count if VM_NORESERVE alloc page cache · af0ed73e
      Joonsoo Kim authored
      If a vma with VM_NORESERVE allocate a new page for page cache, we should
      check whether this area is reserved or not.  If this address is already
      reserved by other process(in case of chg == 0), we should decrement
      reserve count, because this allocated page will go into page cache and
      currently, there is no way to know that this page comes from reserved pool
      or not when releasing inode.  This may introduce over-counting problem to
      reserved count.  With following example code, you can easily reproduce
      this situation.
      
      Assume 2MB, nr_hugepages = 100
      
              size = 20 * MB;
              flag = MAP_SHARED;
              p = mmap(NULL, size, PROT_READ|PROT_WRITE, flag, fd, 0);
              if (p == MAP_FAILED) {
                      fprintf(stderr, "mmap() failed: %s\n", strerror(errno));
                      return -1;
              }
      
              flag = MAP_SHARED | MAP_NORESERVE;
              q = mmap(NULL, size, PROT_READ|PROT_WRITE, flag, fd, 0);
              if (q == MAP_FAILED) {
                      fprintf(stderr, "mmap() failed: %s\n", strerror(errno));
              }
              q[0] = 'c';
      
      After finish the program, run 'cat /proc/meminfo'.  You can see below
      result.
      
      HugePages_Free:      100
      HugePages_Rsvd:        1
      
      To fix this, we should check our mapping type and tracked region.  If our
      mapping is VM_NORESERVE, VM_MAYSHARE and chg is 0, this imply that current
      allocated page will go into page cache which is already reserved region
      when mapping is created.  In this case, we should decrease reserve count.
      As implementing above, this patch solve the problem.
      
      [akpm@linux-foundation.org: fix spelling in comment]
      Signed-off-by: default avatarJoonsoo Kim <iamjoonsoo.kim@lge.com>
      Reviewed-by: default avatarWanpeng Li <liwanp@linux.vnet.ibm.com>
      Reviewed-by: default avatarAneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
      Acked-by: default avatarHillf Danton <dhillf@gmail.com>
      Acked-by: default avatarMichal Hocko <mhocko@suse.cz>
      Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Davidlohr Bueso <davidlohr.bueso@hp.com>
      Cc: David Gibson <david@gibson.dropbear.id.au>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      af0ed73e
    • Joonsoo Kim's avatar
      mm, hugetlb: remove decrement_hugepage_resv_vma() · a63884e9
      Joonsoo Kim authored
      Now, Checking condition of decrement_hugepage_resv_vma() and
      vma_has_reserves() is same, so we can clean-up this function with
      vma_has_reserves().  Additionally, decrement_hugepage_resv_vma() has only
      one call site, so we can remove function and embed it into
      dequeue_huge_page_vma() directly.  This patch implement it.
      Signed-off-by: default avatarJoonsoo Kim <iamjoonsoo.kim@lge.com>
      Reviewed-by: default avatarWanpeng Li <liwanp@linux.vnet.ibm.com>
      Reviewed-by: default avatarAneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
      Acked-by: default avatarHillf Danton <dhillf@gmail.com>
      Acked-by: default avatarMichal Hocko <mhocko@suse.cz>
      Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Davidlohr Bueso <davidlohr.bueso@hp.com>
      Cc: David Gibson <david@gibson.dropbear.id.au>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      a63884e9
    • Joonsoo Kim's avatar
      mm, hugetlb: add VM_NORESERVE check in vma_has_reserves() · 72231b03
      Joonsoo Kim authored
      If we map the region with MAP_NORESERVE and MAP_SHARED, we can skip to
      check reserve counting and eventually we cannot be ensured to allocate a
      huge page in fault time.  With following example code, you can easily find
      this situation.
      
      Assume 2MB, nr_hugepages = 100
      
              fd = hugetlbfs_unlinked_fd();
              if (fd < 0)
                      return 1;
      
              size = 200 * MB;
              flag = MAP_SHARED;
              p = mmap(NULL, size, PROT_READ|PROT_WRITE, flag, fd, 0);
              if (p == MAP_FAILED) {
                      fprintf(stderr, "mmap() failed: %s\n", strerror(errno));
                      return -1;
              }
      
              size = 2 * MB;
              flag = MAP_ANONYMOUS | MAP_SHARED | MAP_HUGETLB | MAP_NORESERVE;
              p = mmap(NULL, size, PROT_READ|PROT_WRITE, flag, -1, 0);
              if (p == MAP_FAILED) {
                      fprintf(stderr, "mmap() failed: %s\n", strerror(errno));
              }
              p[0] = '0';
              sleep(10);
      
      During executing sleep(10), run 'cat /proc/meminfo' on another process.
      
      HugePages_Free:       99
      HugePages_Rsvd:      100
      
      Number of free should be higher or equal than number of reserve, but this
      aren't.  This represent that non reserved shared mapping steal a reserved
      page.  Non reserved shared mapping should not eat into reserve space.
      
      If we consider VM_NORESERVE in vma_has_reserve() and return 0 which mean
      that we don't have reserved pages, then we check that we have enough free
      pages in dequeue_huge_page_vma().  This prevent to steal a reserved page.
      
      With this change, above test generate a SIGBUG which is correct, because
      all free pages are reserved and non reserved shared mapping can't get a
      free page.
      Signed-off-by: default avatarJoonsoo Kim <iamjoonsoo.kim@lge.com>
      Reviewed-by: default avatarWanpeng Li <liwanp@linux.vnet.ibm.com>
      Reviewed-by: default avatarAneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
      Acked-by: default avatarHillf Danton <dhillf@gmail.com>
      Acked-by: default avatarMichal Hocko <mhocko@suse.cz>
      Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Davidlohr Bueso <davidlohr.bueso@hp.com>
      Cc: David Gibson <david@gibson.dropbear.id.au>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      72231b03
    • Joonsoo Kim's avatar
      mm, hugetlb: do not use a page in page cache for cow optimization · 37a2140d
      Joonsoo Kim authored
      Currently, we use a page with mapped count 1 in page cache for cow
      optimization.  If we find this condition, we don't allocate a new page and
      copy contents.  Instead, we map this page directly.  This may introduce a
      problem that writting to private mapping overwrite hugetlb file directly.
      You can find this situation with following code.
      
              size = 20 * MB;
              flag = MAP_SHARED;
              p = mmap(NULL, size, PROT_READ|PROT_WRITE, flag, fd, 0);
              if (p == MAP_FAILED) {
                      fprintf(stderr, "mmap() failed: %s\n", strerror(errno));
                      return -1;
              }
              p[0] = 's';
              fprintf(stdout, "BEFORE STEAL PRIVATE WRITE: %c\n", p[0]);
              munmap(p, size);
      
              flag = MAP_PRIVATE;
              p = mmap(NULL, size, PROT_READ|PROT_WRITE, flag, fd, 0);
              if (p == MAP_FAILED) {
                      fprintf(stderr, "mmap() failed: %s\n", strerror(errno));
              }
              p[0] = 'c';
              munmap(p, size);
      
              flag = MAP_SHARED;
              p = mmap(NULL, size, PROT_READ|PROT_WRITE, flag, fd, 0);
              if (p == MAP_FAILED) {
                      fprintf(stderr, "mmap() failed: %s\n", strerror(errno));
                      return -1;
              }
              fprintf(stdout, "AFTER STEAL PRIVATE WRITE: %c\n", p[0]);
              munmap(p, size);
      
      We can see that "AFTER STEAL PRIVATE WRITE: c", not "AFTER STEAL PRIVATE
      WRITE: s".  If we turn off this optimization to a page in page cache, the
      problem is disappeared.
      
      So, I change the trigger condition of optimization.  If this page is not
      AnonPage, we don't do optimization.  This makes this optimization turning
      off for a page cache.
      Signed-off-by: default avatarJoonsoo Kim <iamjoonsoo.kim@lge.com>
      Acked-by: default avatarMichal Hocko <mhocko@suse.cz>
      Reviewed-by: default avatarWanpeng Li <liwanp@linux.vnet.ibm.com>
      Reviewed-by: default avatarNaoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
      Acked-by: default avatarHillf Danton <dhillf@gmail.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Davidlohr Bueso <davidlohr.bueso@hp.com>
      Cc: David Gibson <david@gibson.dropbear.id.au>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      37a2140d
    • Joonsoo Kim's avatar
      mm, hugetlb: remove redundant list_empty check in gather_surplus_pages() · c0d934ba
      Joonsoo Kim authored
      If list is empty, list_for_each_entry_safe() doesn't do anything.  So,
      this check is redundant.  Remove it.
      Signed-off-by: default avatarJoonsoo Kim <iamjoonsoo.kim@lge.com>
      Acked-by: default avatarMichal Hocko <mhocko@suse.cz>
      Reviewed-by: default avatarWanpeng Li <liwanp@linux.vnet.ibm.com>
      Reviewed-by: default avatarAneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
      Acked-by: default avatarHillf Danton <dhillf@gmail.com>
      Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: Wanpeng Li <liwanp@linux.vnet.ibm.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Davidlohr Bueso <davidlohr.bueso@hp.com>
      Cc: David Gibson <david@gibson.dropbear.id.au>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      c0d934ba
    • Joonsoo Kim's avatar
      mm, hugetlb: fix and clean-up node iteration code to alloc or free · b2261026
      Joonsoo Kim authored
      Current node iteration code have a minor problem which do one more node
      rotation if we can't succeed to allocate.  For example, if we start to
      allocate at node 0, we stop to iterate at node 0.  Then we start to
      allocate at node 1 for next allocation.
      
      I introduce new macros "for_each_node_mask_to_[alloc|free]" and fix and
      clean-up node iteration code to alloc or free.  This makes code more
      understandable.
      Signed-off-by: default avatarJoonsoo Kim <iamjoonsoo.kim@lge.com>
      Reviewed-by: default avatarAneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
      Acked-by: default avatarHillf Danton <dhillf@gmail.com>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: Wanpeng Li <liwanp@linux.vnet.ibm.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Davidlohr Bueso <davidlohr.bueso@hp.com>
      Cc: David Gibson <david@gibson.dropbear.id.au>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      b2261026
    • Joonsoo Kim's avatar
      mm, hugetlb: clean-up alloc_huge_page() · 81a6fcae
      Joonsoo Kim authored
      Unify successful allocation paths to make the code more readable.  There
      are no functional changes.
      Signed-off-by: default avatarJoonsoo Kim <iamjoonsoo.kim@lge.com>
      Acked-by: default avatarMichal Hocko <mhocko@suse.cz>
      Reviewed-by: default avatarWanpeng Li <liwanp@linux.vnet.ibm.com>
      Reviewed-by: default avatarAneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
      Cc: Hillf Danton <dhillf@gmail.com>
      Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: Wanpeng Li <liwanp@linux.vnet.ibm.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Davidlohr Bueso <davidlohr.bueso@hp.com>
      Cc: David Gibson <david@gibson.dropbear.id.au>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      81a6fcae
    • Joonsoo Kim's avatar
      mm, hugetlb: trivial commenting fix · c748c262
      Joonsoo Kim authored
      The name of the mutex written in comment is wrong.  Fix it.
      Signed-off-by: default avatarJoonsoo Kim <iamjoonsoo.kim@lge.com>
      Acked-by: default avatarMichal Hocko <mhocko@suse.cz>
      Acked-by: default avatarHillf Danton <dhillf@gmail.com>
      Reviewed-by: default avatarAneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
      Cc: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
      Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: Wanpeng Li <liwanp@linux.vnet.ibm.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Davidlohr Bueso <davidlohr.bueso@hp.com>
      Cc: David Gibson <david@gibson.dropbear.id.au>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      c748c262
    • Joonsoo Kim's avatar
      mm, hugetlb: move up the code which check availability of free huge page · 9966c4bb
      Joonsoo Kim authored
      In this time we are holding a hugetlb_lock, so hstate values can't be
      changed.  If we don't have any usable free huge page in this time, we
      don't need to proceed with the processing.  So move this code up.
      Signed-off-by: default avatarJoonsoo Kim <iamjoonsoo.kim@lge.com>
      Acked-by: default avatarMichal Hocko <mhocko@suse.cz>
      Reviewed-by: default avatarWanpeng Li <liwanp@linux.vnet.ibm.com>
      Reviewed-by: default avatarAneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
      Cc: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
      Acked-by: default avatarHillf Danton <dhillf@gmail.com>
      Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Davidlohr Bueso <davidlohr.bueso@hp.com>
      Cc: David Gibson <david@gibson.dropbear.id.au>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      9966c4bb
    • Johannes Weiner's avatar
      mm: revert "page-writeback.c: subtract min_free_kbytes from dirtyable memory" · 72457c0a
      Johannes Weiner authored
      This reverts commit 75f7ad8e.  It was the result of a problem
      observed with a 3.2 kernel and merged in 3.9, while the issue had been
      resolved upstream in 3.3 (commit ab8fabd4: "mm: exclude reserved
      pages from dirtyable memory").
      
      The "reserved pages" are a superset of min_free_kbytes, thus this change
      is redundant and confusing.  Revert it.
      Signed-off-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Cc: Paul Szabo <psz@maths.usyd.edu.au>
      Cc: Rik van Riel <riel@redhat.com>
      Acked-by: default avatarMinchan Kim <minchan@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      72457c0a
    • Johannes Weiner's avatar
      mm: page_alloc: fair zone allocator policy · 81c0a2bb
      Johannes Weiner authored
      Each zone that holds userspace pages of one workload must be aged at a
      speed proportional to the zone size.  Otherwise, the time an individual
      page gets to stay in memory depends on the zone it happened to be
      allocated in.  Asymmetry in the zone aging creates rather unpredictable
      aging behavior and results in the wrong pages being reclaimed, activated
      etc.
      
      But exactly this happens right now because of the way the page allocator
      and kswapd interact.  The page allocator uses per-node lists of all zones
      in the system, ordered by preference, when allocating a new page.  When
      the first iteration does not yield any results, kswapd is woken up and the
      allocator retries.  Due to the way kswapd reclaims zones below the high
      watermark while a zone can be allocated from when it is above the low
      watermark, the allocator may keep kswapd running while kswapd reclaim
      ensures that the page allocator can keep allocating from the first zone in
      the zonelist for extended periods of time.  Meanwhile the other zones
      rarely see new allocations and thus get aged much slower in comparison.
      
      The result is that the occasional page placed in lower zones gets
      relatively more time in memory, even gets promoted to the active list
      after its peers have long been evicted.  Meanwhile, the bulk of the
      working set may be thrashing on the preferred zone even though there may
      be significant amounts of memory available in the lower zones.
      
      Even the most basic test -- repeatedly reading a file slightly bigger than
      memory -- shows how broken the zone aging is.  In this scenario, no single
      page should be able stay in memory long enough to get referenced twice and
      activated, but activation happens in spades:
      
        $ grep active_file /proc/zoneinfo
            nr_inactive_file 0
            nr_active_file 0
            nr_inactive_file 0
            nr_active_file 8
            nr_inactive_file 1582
            nr_active_file 11994
        $ cat data data data data >/dev/null
        $ grep active_file /proc/zoneinfo
            nr_inactive_file 0
            nr_active_file 70
            nr_inactive_file 258753
            nr_active_file 443214
            nr_inactive_file 149793
            nr_active_file 12021
      
      Fix this with a very simple round robin allocator.  Each zone is allowed a
      batch of allocations that is proportional to the zone's size, after which
      it is treated as full.  The batch counters are reset when all zones have
      been tried and the allocator enters the slowpath and kicks off kswapd
      reclaim.  Allocation and reclaim is now fairly spread out to all
      available/allowable zones:
      
        $ grep active_file /proc/zoneinfo
            nr_inactive_file 0
            nr_active_file 0
            nr_inactive_file 174
            nr_active_file 4865
            nr_inactive_file 53
            nr_active_file 860
        $ cat data data data data >/dev/null
        $ grep active_file /proc/zoneinfo
            nr_inactive_file 0
            nr_active_file 0
            nr_inactive_file 666622
            nr_active_file 4988
            nr_inactive_file 190969
            nr_active_file 937
      
      When zone_reclaim_mode is enabled, allocations will now spread out to all
      zones on the local node, not just the first preferred zone (which on a 4G
      node might be a tiny Normal zone).
      Signed-off-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: default avatarMel Gorman <mgorman@suse.de>
      Reviewed-by: default avatarRik van Riel <riel@redhat.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Paul Bolle <paul.bollee@gmail.com>
      Cc: Zlatko Calusic <zcalusic@bitsync.net>
      Tested-by: default avatarKevin Hilman <khilman@linaro.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      81c0a2bb
    • Johannes Weiner's avatar
      mm: page_alloc: rearrange watermark checking in get_page_from_freelist · e085dbc5
      Johannes Weiner authored
      Allocations that do not have to respect the watermarks are rare
      high-priority events.  Reorder the code such that per-zone dirty limits
      and future checks important only to regular page allocations are ignored
      in these extraordinary situations.
      Signed-off-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Cc: Mel Gorman <mgorman@suse.de>
      Reviewed-by: default avatarRik van Riel <riel@redhat.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Paul Bolle <paul.bollee@gmail.com>
      Tested-by: default avatarZlatko Calusic <zcalusic@bitsync.net>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      e085dbc5
    • Johannes Weiner's avatar
      mm: vmscan: fix numa reclaim balance problem in kswapd · 892f795d
      Johannes Weiner authored
      The way the page allocator interacts with kswapd creates aging imbalances,
      where the amount of time a userspace page gets in memory under reclaim
      pressure is dependent on which zone, which node the allocator took the
      page frame from.
      
      #1 fixes missed kswapd wakeups on NUMA systems, which lead to some
         nodes falling behind for a full reclaim cycle relative to the other
         nodes in the system
      
      #3 fixes an interaction where kswapd and a continuous stream of page
         allocations keep the preferred zone of a task between the high and
         low watermark (allocations succeed + kswapd does not go to sleep)
         indefinitely, completely underutilizing the lower zones and
         thrashing on the preferred zone
      
      These patches are the aging fairness part of the thrash-detection based
      file LRU balancing.  Andrea recommended to submit them separately as they
      are bugfixes in their own right.
      
      The following test ran a foreground workload (memcachetest) with
      background IO of various sizes on a 4 node 8G system (similar results were
      observed with single-node 4G systems):
      
      parallelio
                                                     BAS                    FAIRALLO
                                                    BASE                   FAIRALLOC
      Ops memcachetest-0M              5170.00 (  0.00%)           5283.00 (  2.19%)
      Ops memcachetest-791M            4740.00 (  0.00%)           5293.00 ( 11.67%)
      Ops memcachetest-2639M           2551.00 (  0.00%)           4950.00 ( 94.04%)
      Ops memcachetest-4487M           2606.00 (  0.00%)           3922.00 ( 50.50%)
      Ops io-duration-0M                  0.00 (  0.00%)              0.00 (  0.00%)
      Ops io-duration-791M               55.00 (  0.00%)             18.00 ( 67.27%)
      Ops io-duration-2639M             235.00 (  0.00%)            103.00 ( 56.17%)
      Ops io-duration-4487M             278.00 (  0.00%)            173.00 ( 37.77%)
      Ops swaptotal-0M                    0.00 (  0.00%)              0.00 (  0.00%)
      Ops swaptotal-791M             245184.00 (  0.00%)              0.00 (  0.00%)
      Ops swaptotal-2639M            468069.00 (  0.00%)         108778.00 ( 76.76%)
      Ops swaptotal-4487M            452529.00 (  0.00%)          76623.00 ( 83.07%)
      Ops swapin-0M                       0.00 (  0.00%)              0.00 (  0.00%)
      Ops swapin-791M                108297.00 (  0.00%)              0.00 (  0.00%)
      Ops swapin-2639M               169537.00 (  0.00%)          50031.00 ( 70.49%)
      Ops swapin-4487M               167435.00 (  0.00%)          34178.00 ( 79.59%)
      Ops minorfaults-0M            1518666.00 (  0.00%)        1503993.00 (  0.97%)
      Ops minorfaults-791M          1676963.00 (  0.00%)        1520115.00 (  9.35%)
      Ops minorfaults-2639M         1606035.00 (  0.00%)        1799717.00 (-12.06%)
      Ops minorfaults-4487M         1612118.00 (  0.00%)        1583825.00 (  1.76%)
      Ops majorfaults-0M                  6.00 (  0.00%)              0.00 (  0.00%)
      Ops majorfaults-791M            13836.00 (  0.00%)             10.00 ( 99.93%)
      Ops majorfaults-2639M           22307.00 (  0.00%)           6490.00 ( 70.91%)
      Ops majorfaults-4487M           21631.00 (  0.00%)           4380.00 ( 79.75%)
      
                       BAS    FAIRALLO
                      BASE   FAIRALLOC
      User          287.78      460.97
      System       2151.67     3142.51
      Elapsed      9737.00     8879.34
      
                                         BAS    FAIRALLO
                                        BASE   FAIRALLOC
      Minor Faults                  53721925    57188551
      Major Faults                    392195       15157
      Swap Ins                       2994854      112770
      Swap Outs                      4907092      134982
      Direct pages scanned                 0       41824
      Kswapd pages scanned          32975063     8128269
      Kswapd pages reclaimed         6323069     7093495
      Direct pages reclaimed               0       41824
      Kswapd efficiency                  19%         87%
      Kswapd velocity               3386.573     915.414
      Direct efficiency                 100%        100%
      Direct velocity                  0.000       4.710
      Percentage direct scans             0%          0%
      Zone normal velocity          2011.338     550.661
      Zone dma32 velocity           1365.623     369.221
      Zone dma velocity                9.612       0.242
      Page writes by reclaim    18732404.000  614807.000
      Page writes file              13825312      479825
      Page writes anon               4907092      134982
      Page reclaim immediate           85490        5647
      Sector Reads                  12080532      483244
      Sector Writes                 88740508    65438876
      Page rescued immediate               0           0
      Slabs scanned                    82560       12160
      Direct inode steals                  0           0
      Kswapd inode steals              24401       40013
      Kswapd skipped wait                  0           0
      THP fault alloc                      6           8
      THP collapse alloc                5481        5812
      THP splits                          75          22
      THP fault fallback                   0           0
      THP collapse fail                    0           0
      Compaction stalls                    0          54
      Compaction success                   0          45
      Compaction failures                  0           9
      Page migrate success            881492       82278
      Page migrate failure                 0           0
      Compaction pages isolated            0       60334
      Compaction migrate scanned           0       53505
      Compaction free scanned              0     1537605
      Compaction cost                    914          86
      NUMA PTE updates              46738231    41988419
      NUMA hint faults              31175564    24213387
      NUMA hint local faults        10427393     6411593
      NUMA pages migrated             881492       55344
      AutoNUMA cost                   156221      121361
      
      The overall runtime was reduced, throughput for both the foreground
      workload as well as the background IO improved, major faults, swapping and
      reclaim activity shrunk significantly, reclaim efficiency more than
      quadrupled.
      
      This patch:
      
      When the page allocator fails to get a page from all zones in its given
      zonelist, it wakes up the per-node kswapds for all zones that are at their
      low watermark.
      
      However, with a system under load the free pages in a zone can fluctuate
      enough that the allocation fails but the kswapd wakeup is also skipped
      while the zone is still really close to the low watermark.
      
      When one node misses a wakeup like this, it won't be aged before all the
      other node's zones are down to their low watermarks again.  And skipping a
      full aging cycle is an obvious fairness problem.
      
      Kswapd runs until the high watermarks are restored, so it should also be
      woken when the high watermarks are not met.  This ages nodes more equally
      and creates a safety margin for the page counter fluctuation.
      
      By using zone_balanced(), it will now check, in addition to the watermark,
      if compaction requires more order-0 pages to create a higher order page.
      Signed-off-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Cc: Mel Gorman <mgorman@suse.de>
      Reviewed-by: default avatarRik van Riel <riel@redhat.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Paul Bolle <paul.bollee@gmail.com>
      Tested-by: default avatarZlatko Calusic <zcalusic@bitsync.net>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      892f795d
    • Libin's avatar
      mm/huge_memory.c: fix potential NULL pointer dereference · a8f531eb
      Libin authored
      In collapse_huge_page() there is a race window between releasing the
      mmap_sem read lock and taking the mmap_sem write lock, so find_vma() may
      return NULL.  So check the return value to avoid NULL pointer dereference.
      
      collapse_huge_page
      	khugepaged_alloc_page
      		up_read(&mm->mmap_sem)
      	down_write(&mm->mmap_sem)
      	vma = find_vma(mm, address)
      Signed-off-by: default avatarLibin <huawei.libin@huawei.com>
      Acked-by: default avatarKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Reviewed-by: default avatarWanpeng Li <liwanp@linux.vnet.ibm.com>
      Reviewed-by: default avatarMichal Hocko <mhocko@suse.cz>
      Cc: <stable@vger.kernel.org> # v3.0+
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      a8f531eb
    • Yinghai Lu's avatar
      mm: kill one if loop in __free_pages_bootmem() · e2d0bd2b
      Yinghai Lu authored
      We should not check loop+1 with loop end in loop body.  Just duplicate two
      lines code to avoid it.
      
      That will help a bit when we have huge amount of pages on system with
      16TiB memory.
      Signed-off-by: default avatarYinghai Lu <yinghai@kernel.org>
      Cc: Mel Gorman <mgorman@suse.de>
      Acked-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      e2d0bd2b
    • Srivatsa S. Bhat's avatar
      mm/page_alloc.c: fix the value of fallback_migratetype in alloc_extfrag tracepoint() · f92310c1
      Srivatsa S. Bhat authored
      In the current code, the value of fallback_migratetype that is printed
      using the mm_page_alloc_extfrag tracepoint, is the value of the
      migratetype *after* it has been set to the preferred migratetype (if the
      ownership was changed).  Obviously that wouldn't have been the original
      intent.  (We already have a separate 'change_ownership' field to tell
      whether the ownership of the pageblock was changed from the
      fallback_migratetype to the preferred type.)
      
      The intent of the fallback_migratetype field is to show the migratetype
      from which we borrowed pages in order to satisfy the allocation request.
      So fix the code to print that value correctly.
      Signed-off-by: default avatarSrivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Cody P Schafer <cody@linux.vnet.ibm.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      f92310c1
    • Srivatsa S. Bhat's avatar
      mm/page_allo.c: restructure free-page stealing code and fix a bug · fef903ef
      Srivatsa S. Bhat authored
      The free-page stealing code in __rmqueue_fallback() is somewhat hard to
      follow, and has an incredible amount of subtlety hidden inside!
      
      First off, there is a minor bug in the reporting of change-of-ownership of
      pageblocks.  Under some conditions, we try to move upto
      'pageblock_nr_pages' no.  of pages to the preferred allocation list.  But
      we change the ownership of that pageblock to the preferred type only if we
      manage to successfully move atleast half of that pageblock (or if
      page_group_by_mobility_disabled is set).
      
      However, the current code ignores the latter part and sets the
      'migratetype' variable to the preferred type, irrespective of whether we
      actually changed the pageblock migratetype of that block or not.  So, the
      page_alloc_extfrag tracepoint can end up printing incorrect info (i.e.,
      'change_ownership' might be shown as 1 when it must have been 0).
      
      So fixing this involves moving the update of the 'migratetype' variable to
      the right place.  But looking closer, we observe that the 'migratetype'
      variable is used subsequently for checks such as "is_migrate_cma()".
      Obviously the intent there is to check if the *fallback* type is
      MIGRATE_CMA, but since we already set the 'migratetype' variable to
      start_migratetype, we end up checking if the *preferred* type is
      MIGRATE_CMA!!
      
      To make things more interesting, this actually doesn't cause a bug in
      practice, because we never change *anything* if the fallback type is CMA.
      
      So, restructure the code in such a way that it is trivial to understand
      what is going on, and also fix the above mentioned bug.  And while at it,
      also add a comment explaining the subtlety behind the migratetype used in
      the call to expand().
      
      [akpm@linux-foundation.org: remove unneeded `inline', small coding-style fix]
      Signed-off-by: default avatarSrivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Cody P Schafer <cody@linux.vnet.ibm.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      fef903ef
    • Pintu Kumar's avatar
      mm/page_alloc.c: fix coding style and spelling · b8af2941
      Pintu Kumar authored
      Fix all errors reported by checkpatch and some small spelling mistakes.
      Signed-off-by: default avatarPintu Kumar <pintu.k@samsung.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      b8af2941
    • Shaohua Li's avatar
      swap: make cluster allocation per-cpu · ebc2a1a6
      Shaohua Li authored
      swap cluster allocation is to get better request merge to improve
      performance.  But the cluster is shared globally, if multiple tasks are
      doing swap, this will cause interleave disk access.  While multiple tasks
      swap is quite common, for example, each numa node has a kswapd thread
      doing swap and multiple threads/processes doing direct page reclaim.
      
      ioscheduler can't help too much here, because tasks don't send swapout IO
      down to block layer in the meantime.  Block layer does merge some IOs, but
      a lot not, depending on how many tasks are doing swapout concurrently.  In
      practice, I've seen a lot of small size IO in swapout workloads.
      
      We makes the cluster allocation per-cpu here.  The interleave disk access
      issue goes away.  All tasks swapout to their own cluster, so swapout will
      become sequential, which can be easily merged to big size IO.  If one CPU
      can't get its per-cpu cluster (for example, there is no free cluster
      anymore in the swap), it will fallback to scan swap_map.  The CPU can
      still continue swap.  We don't need recycle free swap entries of other
      CPUs.
      
      In my test (swap to a 2-disk raid0 partition), this improves around 10%
      swapout throughput, and request size is increased significantly.
      
      How does this impact swap readahead is uncertain though.  On one side,
      page reclaim always isolates and swaps several adjancent pages, this will
      make page reclaim write the pages sequentially and benefit readahead.  On
      the other side, several CPU write pages interleave means the pages don't
      live _sequentially_ but relatively _near_.  In the per-cpu allocation
      case, if adjancent pages are written by different cpus, they will live
      relatively _far_.  So how this impacts swap readahead depends on how many
      pages page reclaim isolates and swaps one time.  If the number is big,
      this patch will benefit swap readahead.  Of course, this is about
      sequential access pattern.  The patch has no impact for random access
      pattern, because the new cluster allocation algorithm is just for SSD.
      
      Alternative solution is organizing swap layout to be per-mm instead of
      this per-cpu approach.  In the per-mm layout, we allocate a disk range for
      each mm, so pages of one mm live in swap disk adjacently.  per-mm layout
      has potential issues of lock contention if multiple reclaimers are swap
      pages from one mm.  For a sequential workload, per-mm layout is better to
      implement swap readahead, because pages from the mm are adjacent in disk.
      But per-cpu layout isn't very bad in this workload, as page reclaim always
      isolates and swaps several pages one time, such pages will still live in
      disk sequentially and readahead can utilize this.  For a random workload,
      per-mm layout isn't beneficial of request merge, because it's quite
      possible pages from different mm are swapout in the meantime and IO can't
      be merged in per-mm layout.  while with per-cpu layout we can merge
      requests from any mm.  Considering random workload is more popular in
      workloads with swap (and per-cpu approach isn't too bad for sequential
      workload too), I'm choosing per-cpu layout.
      
      [akpm@linux-foundation.org: coding-style fixes]
      Signed-off-by: default avatarShaohua Li <shli@fusionio.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Kyungmin Park <kmpark@infradead.org>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Rafael Aquini <aquini@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      ebc2a1a6
    • Shaohua Li's avatar
      swap: fix races exposed by swap discard · edfe23da
      Shaohua Li authored
      The previous patch can expose races, according to Hugh:
      
      swapoff was sometimes failing with "Cannot allocate memory", coming from
      try_to_unuse()'s -ENOMEM: it needs to allow for swap_duplicate() failing
      on a free entry temporarily SWAP_MAP_BAD while being discarded.
      
      We should use ACCESS_ONCE() there, and whenever accessing swap_map
      locklessly; but rather than peppering it throughout try_to_unuse(), just
      declare *swap_map with volatile.
      
      try_to_unuse() is accustomed to *swap_map going down racily, but not
      necessarily to it jumping up from 0 to SWAP_MAP_BAD: we'll be safer to
      prevent that transition once SWP_WRITEOK is switched off, when it's a
      waste of time to issue discards anyway (swapon can do a whole discard).
      
      Another issue is:
      
      In swapin_readahead(), read_swap_cache_async() can read a bad swap entry,
      because we don't check if readahead swap entry is bad.  This doesn't break
      anything but such swapin page is wasteful and can only be freed at page
      reclaim.  We should avoid read such swap entry.  And in discard, we mark
      swap entry SWAP_MAP_BAD and then switch it to normal when discard is
      finished.  If readahead reads such swap entry, we have the same issue, so
      we much check if swap entry is bad too.
      
      Thanks Hugh to inspire swapin_readahead could use bad swap entry.
      
      [include Hugh's patch 'swap: fix swapoff ENOMEMs from discard']
      Signed-off-by: default avatarShaohua Li <shli@fusionio.com>
      Signed-off-by: default avatarHugh Dickins <hughd@google.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Kyungmin Park <kmpark@infradead.org>
      Cc: Rafael Aquini <aquini@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      edfe23da
    • Shaohua Li's avatar
      swap: make swap discard async · 815c2c54
      Shaohua Li authored
      swap can do cluster discard for SSD, which is good, but there are some
      problems here:
      
      1. swap do the discard just before page reclaim gets a swap entry and
         writes the disk sectors.  This is useless for high end SSD, because an
         overwrite to a sector implies a discard to original sector too.  A
         discard + overwrite == overwrite.
      
      2. the purpose of doing discard is to improve SSD firmware garbage
         collection.  Idealy we should send discard as early as possible, so
         firmware can do something smart.  Sending discard just after swap entry
         is freed is considered early compared to sending discard before write.
         Of course, if workload is already bound to gc speed, sending discard
         earlier or later doesn't make
      
      3. block discard is a sync API, which will delay scan_swap_map()
         significantly.
      
      4. Write and discard command can be executed parallel in PCIe SSD.
         Making swap discard async can make execution more efficiently.
      
      This patch makes swap discard async and moves discard to where swap entry
      is freed.  Discard and write have no dependence now, so above issues can
      be avoided.  Idealy we should do discard for any freed sectors, but some
      SSD discard is very slow.  This patch still does discard for a whole
      cluster.
      
      My test does a several round of 'mmap, write, unmap', which will trigger a
      lot of swap discard.  In a fusionio card, with this patch, the test
      runtime is reduced to 18% of the time without it, so around 5.5x faster.
      
      [akpm@linux-foundation.org: coding-style fixes]
      Signed-off-by: default avatarShaohua Li <shli@fusionio.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Kyungmin Park <kmpark@infradead.org>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Rafael Aquini <aquini@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      815c2c54
    • Shaohua Li's avatar
      swap: change block allocation algorithm for SSD · 2a8f9449
      Shaohua Li authored
      I'm using a fast SSD to do swap.  scan_swap_map() sometimes uses up to
      20~30% CPU time (when cluster is hard to find, the CPU time can be up to
      80%), which becomes a bottleneck.  scan_swap_map() scans a byte array to
      search a 256 page cluster, which is very slow.
      
      Here I introduced a simple algorithm to search cluster.  Since we only
      care about 256 pages cluster, we can just use a counter to track if a
      cluster is free.  Every 256 pages use one int to store the counter.  If
      the counter of a cluster is 0, the cluster is free.  All free clusters
      will be added to a list, so searching cluster is very efficient.  With
      this, scap_swap_map() overhead disappears.
      
      This might help low end SD card swap too.  Because if the cluster is
      aligned, SD firmware can do flash erase more efficiently.
      
      We only enable the algorithm for SSD.  Hard disk swap isn't fast enough
      and has downside with the algorithm which might introduce regression (see
      below).
      
      The patch slightly changes which cluster is choosen.  It always adds free
      cluster to list tail.  This can help wear leveling for low end SSD too.
      And if no cluster found, the scan_swap_map() will do search from the end
      of last cluster.  So if no cluster found, the scan_swap_map() will do
      search from the end of last free cluster, which is random.  For SSD, this
      isn't a problem at all.
      
      Another downside is the cluster must be aligned to 256 pages, which will
      reduce the chance to find a cluster.  I would expect this isn't a big
      problem for SSD because of the non-seek penality.  (And this is the reason
      I only enable the algorithm for SSD).
      Signed-off-by: default avatarShaohua Li <shli@fusionio.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Kyungmin Park <kmpark@infradead.org>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Rafael Aquini <aquini@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      2a8f9449