1. 04 Jun, 2014 40 commits
    • Denys Vlasenko's avatar
      Documentation/sysctl/vm.txt: clarify vfs_cache_pressure description · 4a0da71b
      Denys Vlasenko authored
      Existing description is worded in a way which almost encourages setting of
      vfs_cache_pressure above 100, possibly way above it.
      
      Users are left in a dark what this numeric value is - an int?  a
      percentage?  what the scale is?
      
      As a result, we are getting reports about noticeable performance
      degradation from users who have set vfs_cache_pressure to ridiculously
      high values - because they thought there is no downside to it.
      
      Via code inspection it's obvious that this value is treated as a
      percentage.  This patch changes text to reflect this fact, and adds a
      cautionary paragraph advising against setting vfs_cache_pressure sky high.
      Signed-off-by: default avatarDenys Vlasenko <dvlasenk@redhat.com>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      4a0da71b
    • Naoya Horiguchi's avatar
      mm/memory-failure.c: support use of a dedicated thread to handle SIGBUS(BUS_MCEERR_AO) · 3ba08129
      Naoya Horiguchi authored
      Currently memory error handler handles action optional errors in the
      deferred manner by default.  And if a recovery aware application wants
      to handle it immediately, it can do it by setting PF_MCE_EARLY flag.
      However, such signal can be sent only to the main thread, so it's
      problematic if the application wants to have a dedicated thread to
      handler such signals.
      
      So this patch adds dedicated thread support to memory error handler.  We
      have PF_MCE_EARLY flags for each thread separately, so with this patch
      AO signal is sent to the thread with PF_MCE_EARLY flag set, not the main
      thread.  If you want to implement a dedicated thread, you call prctl()
      to set PF_MCE_EARLY on the thread.
      
      Memory error handler collects processes to be killed, so this patch lets
      it check PF_MCE_EARLY flag on each thread in the collecting routines.
      
      No behavioral change for all non-early kill cases.
      
      Tony said:
      
      : The old behavior was crazy - someone with a multithreaded process might
      : well expect that if they call prctl(PF_MCE_EARLY) in just one thread, then
      : that thread would see the SIGBUS with si_code = BUS_MCEERR_A0 - even if
      : that thread wasn't the main thread for the process.
      
      [akpm@linux-foundation.org: coding-style fixes]
      Signed-off-by: default avatarNaoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Reviewed-by: default avatarTony Luck <tony.luck@intel.com>
      Cc: Kamil Iskra <iskra@mcs.anl.gov>
      Cc: Andi Kleen <andi@firstfloor.org>
      Cc: Borislav Petkov <bp@suse.de>
      Cc: Chen Gong <gong.chen@linux.jf.intel.com>
      Cc: <stable@vger.kernel.org>	[3.2+]
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      3ba08129
    • Tony Luck's avatar
      mm/memory-failure.c: don't let collect_procs() skip over processes for MF_ACTION_REQUIRED · 74614de1
      Tony Luck authored
      When Linux sees an "action optional" machine check (where h/w has reported
      an error that is not in the current execution path) we generally do not
      want to signal a process, since most processes do not have a SIGBUS
      handler - we'd just prematurely terminate the process for a problem that
      they might never actually see.
      
      task_early_kill() decides whether to consider a process - and it checks
      whether this specific process has been marked for early signals with
      "prctl", or if the system administrator has requested early signals for
      all processes using /proc/sys/vm/memory_failure_early_kill.
      
      But for MF_ACTION_REQUIRED case we must not defer.  The error is in the
      execution path of the current thread so we must send the SIGBUS
      immediatley.
      
      Fix by passing a flag argument through collect_procs*() to
      task_early_kill() so it knows whether we can defer or must take action.
      Signed-off-by: default avatarTony Luck <tony.luck@intel.com>
      Signed-off-by: default avatarNaoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: Andi Kleen <andi@firstfloor.org>
      Cc: Borislav Petkov <bp@suse.de>
      Cc: Chen Gong <gong.chen@linux.jf.intel.com>
      Cc: <stable@vger.kernel.org>	[3.2+]
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      74614de1
    • Tony Luck's avatar
      mm/memory-failure.c-failure: send right signal code to correct thread · a70ffcac
      Tony Luck authored
      When a thread in a multi-threaded application hits a machine check because
      of an uncorrectable error in memory - we want to send the SIGBUS with
      si.si_code = BUS_MCEERR_AR to that thread.  Currently we fail to do that
      if the active thread is not the primary thread in the process.
      collect_procs() just finds primary threads and this test:
      
      	if ((flags & MF_ACTION_REQUIRED) && t == current) {
      
      will see that the thread we found isn't the current thread and so send a
      si.si_code = BUS_MCEERR_AO to the primary (and nothing to the active
      thread at this time).
      
      We can fix this by checking whether "current" shares the same mm with the
      process that collect_procs() said owned the page.  If so, we send the
      SIGBUS to current (with code BUS_MCEERR_AR).
      Signed-off-by: default avatarTony Luck <tony.luck@intel.com>
      Signed-off-by: default avatarNaoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Reported-by: default avatarOtto Bruggeman <otto.g.bruggeman@intel.com>
      Cc: Andi Kleen <andi@firstfloor.org>
      Cc: Borislav Petkov <bp@suse.de>
      Cc: Chen Gong <gong.chen@linux.jf.intel.com>
      Cc: <stable@vger.kernel.org>	[3.2+]
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      a70ffcac
    • Jianyu Zhan's avatar
      mm/page-writeback.c: remove outdated comment · d2f31028
      Jianyu Zhan authored
      There is an orphaned prehistoric comment , which used to be against
      get_dirty_limits(), the dawn of global_dirtyable_memory().
      
      Back then, the implementation of get_dirty_limits() is complicated and
      full of magic numbers, so this comment is necessary.  But we now use the
      clear and neat global_dirtyable_memory(), which renders this comment
      ambiguous and useless.  Remove it.
      Signed-off-by: default avatarJianyu Zhan <nasa4836@gmail.com>
      Acked-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      d2f31028
    • Chen Yucong's avatar
      mm/swapfile.c: delete the "last_in_cluster < scan_base" loop in the body of scan_swap_map() · 50088c44
      Chen Yucong authored
      Via commit ebc2a1a6 ("swap: make cluster allocation per-cpu"), we
      can find that all SWP_SOLIDSTATE "seek is cheap"(SSD case) has already
      gone to si->cluster_info scan_swap_map_try_ssd_cluster() route.  So that
      the "last_in_cluster < scan_base" loop in the body of scan_swap_map()
      has already become a dead code snippet, and it should have been deleted.
      
      This patch is to delete the redundant loop as Hugh and Shaohua
      suggested.
      
      [hughd@google.com: fix comment, simplify code]
      Signed-off-by: default avatarChen Yucong <slaoub@gmail.com>
      Cc: Shaohua Li <shli@kernel.org>
      Acked-by: default avatarHugh Dickins <hughd@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      50088c44
    • Naoya Horiguchi's avatar
      hugetlb: rename hugepage_migration_support() to ..._supported() · 100873d7
      Naoya Horiguchi authored
      We already have a function named hugepages_supported(), and the similar
      name hugepage_migration_support() is a bit unconfortable, so let's rename
      it hugepage_migration_supported().
      Signed-off-by: default avatarNaoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Acked-by: default avatarHugh Dickins <hughd@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      100873d7
    • Kirill A. Shutemov's avatar
      mm: document do_fault_around() feature · 1fdb412b
      Kirill A. Shutemov authored
      Some clarification on how faultaround works.
      
      [akpm@linux-foundation.org: tweak comment text]
      Signed-off-by: default avatarKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      1fdb412b
    • Kirill A. Shutemov's avatar
      mm: nominate faultaround area in bytes rather than page order · a9b0f861
      Kirill A. Shutemov authored
      There is evidencs that the faultaround feature is less relevant on
      architectures with page size bigger then 4k.  Which makes sense since page
      fault overhead per byte of mapped area should be less there.
      
      Let's rework the feature to specify faultaround area in bytes instead of
      page order.  It's 64 kilobytes for now.
      
      The patch effectively disables faultaround on architectures with page size
      >= 64k (like ppc64).
      
      It's possible that some other size of faultaround area is relevant for a
      platform.  We can expose `fault_around_bytes' variable to arch-specific
      code once such platforms will be found.
      Signed-off-by: default avatarKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Rusty Russell <rusty@rustcorp.com.au>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Madhavan Srinivasan <maddy@linux.vnet.ibm.com>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Andi Kleen <ak@linux.intel.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Ingo Molnar <mingo@kernel.org>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      a9b0f861
    • Zhang Zhen's avatar
      mm/page_alloc.c: cleanup add_active_range() related comments · 7d018176
      Zhang Zhen authored
      add_active_range() has been repalced by memblock_set_node().  Clean up the
      comments to comply with that change.
      Signed-off-by: default avatarZhang Zhen <zhenzhang.zhang@huawei.com>
      Cc: Tejun Heo <tj@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      7d018176
    • Konstantin Khlebnikov's avatar
      mm/rmap.c: cleanup ttu_flags · daa5ba76
      Konstantin Khlebnikov authored
      Transform action part of ttu_flags into individiual bits.  These flags
      aren't part of any uses-space visible api or even trace events.
      Signed-off-by: default avatarKonstantin Khlebnikov <koct9i@gmail.com>
      Cc: Rik van Riel <riel@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      daa5ba76
    • Konstantin Khlebnikov's avatar
      mm/rmap.c: don't call mmu_notifier_invalidate_page() during munlock · 3d92860f
      Konstantin Khlebnikov authored
      In its munmap mode, try_to_unmap_one() searches other mlocked vmas, it
      never unmaps pages.  There is no reason for invalidation because ptes are
      left unchanged.
      Signed-off-by: default avatarKonstantin Khlebnikov <koct9i@gmail.com>
      Cc: Rik van Riel <riel@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      3d92860f
    • Konstantin Khlebnikov's avatar
      mm/process_vm_access: move config option into init/Kconfig · 226b4ccd
      Konstantin Khlebnikov authored
      CONFIG_CROSS_MEMORY_ATTACH adds couple syscalls: process_vm_readv and
      process_vm_writev, it's a kind of IPC for copying data between processes.
      Currently this option is placed inside "Processor type and features".
      
      This patch moves it into "General setup" (where all other arch-independed
      syscalls and ipc features are placed) and changes prompt string to less
      cryptic.
      Signed-off-by: default avatarKonstantin Khlebnikov <koct9i@gmail.com>
      Cc: Christopher Yeoh <cyeoh@au1.ibm.com>
      Cc: Davidlohr Bueso <davidlohr@hp.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      226b4ccd
    • Mel Gorman's avatar
      mm: vmscan: use proportional scanning during direct reclaim and full scan at DEF_PRIORITY · 1a501907
      Mel Gorman authored
      Commit "mm: vmscan: obey proportional scanning requirements for kswapd"
      ensured that file/anon lists were scanned proportionally for reclaim from
      kswapd but ignored it for direct reclaim.  The intent was to minimse
      direct reclaim latency but Yuanhan Liu pointer out that it substitutes one
      long stall for many small stalls and distorts aging for normal workloads
      like streaming readers/writers.  Hugh Dickins pointed out that a
      side-effect of the same commit was that when one LRU list dropped to zero
      that the entirety of the other list was shrunk leading to excessive
      reclaim in memcgs.  This patch scans the file/anon lists proportionally
      for direct reclaim to similarly age page whether reclaimed by kswapd or
      direct reclaim but takes care to abort reclaim if one LRU drops to zero
      after reclaiming the requested number of pages.
      
      Based on ext4 and using the Intel VM scalability test
      
                                                    3.15.0-rc5            3.15.0-rc5
                                                      shrinker            proportion
      Unit  lru-file-readonce    elapsed      5.3500 (  0.00%)      5.4200 ( -1.31%)
      Unit  lru-file-readonce time_range      0.2700 (  0.00%)      0.1400 ( 48.15%)
      Unit  lru-file-readonce time_stddv      0.1148 (  0.00%)      0.0536 ( 53.33%)
      Unit lru-file-readtwice    elapsed      8.1700 (  0.00%)      8.1700 (  0.00%)
      Unit lru-file-readtwice time_range      0.4300 (  0.00%)      0.2300 ( 46.51%)
      Unit lru-file-readtwice time_stddv      0.1650 (  0.00%)      0.0971 ( 41.16%)
      
      The test cases are running multiple dd instances reading sparse files. The results are within
      the noise for the small test machine. The impact of the patch is more noticable from the vmstats
      
                                  3.15.0-rc5  3.15.0-rc5
                                    shrinker  proportion
      Minor Faults                     35154       36784
      Major Faults                       611        1305
      Swap Ins                           394        1651
      Swap Outs                         4394        5891
      Allocation stalls               118616       44781
      Direct pages scanned           4935171     4602313
      Kswapd pages scanned          15921292    16258483
      Kswapd pages reclaimed        15913301    16248305
      Direct pages reclaimed         4933368     4601133
      Kswapd efficiency                  99%         99%
      Kswapd velocity             670088.047  682555.961
      Direct efficiency                  99%         99%
      Direct velocity             207709.217  193212.133
      Percentage direct scans            23%         22%
      Page writes by reclaim        4858.000    6232.000
      Page writes file                   464         341
      Page writes anon                  4394        5891
      
      Note that there are fewer allocation stalls even though the amount
      of direct reclaim scanning is very approximately the same.
      Signed-off-by: default avatarMel Gorman <mgorman@suse.de>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Tim Chen <tim.c.chen@linux.intel.com>
      Cc: Dave Chinner <david@fromorbit.com>
      Tested-by: default avatarYuanhan Liu <yuanhan.liu@linux.intel.com>
      Cc: Bob Liu <bob.liu@oracle.com>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      1a501907
    • Tim Chen's avatar
      fs/superblock: avoid locking counting inodes and dentries before reclaiming them · d23da150
      Tim Chen authored
      We remove the call to grab_super_passive in call to super_cache_count.
      This becomes a scalability bottleneck as multiple threads are trying to do
      memory reclamation, e.g.  when we are doing large amount of file read and
      page cache is under pressure.  The cached objects quickly got reclaimed
      down to 0 and we are aborting the cache_scan() reclaim.  But counting
      creates a log jam acquiring the sb_lock.
      
      We are holding the shrinker_rwsem which ensures the safety of call to
      list_lru_count_node() and s_op->nr_cached_objects.  The shrinker is
      unregistered now before ->kill_sb() so the operation is safe when we are
      doing unmount.
      
      The impact will depend heavily on the machine and the workload but for a
      small machine using postmark tuned to use 4xRAM size the results were
      
                                        3.15.0-rc5            3.15.0-rc5
                                           vanilla         shrinker-v1r1
      Ops/sec Transactions         21.00 (  0.00%)       24.00 ( 14.29%)
      Ops/sec FilesCreate          39.00 (  0.00%)       44.00 ( 12.82%)
      Ops/sec CreateTransact       10.00 (  0.00%)       12.00 ( 20.00%)
      Ops/sec FilesDeleted       6202.00 (  0.00%)     6202.00 (  0.00%)
      Ops/sec DeleteTransact       11.00 (  0.00%)       12.00 (  9.09%)
      Ops/sec DataRead/MB          25.97 (  0.00%)       29.10 ( 12.05%)
      Ops/sec DataWrite/MB         49.99 (  0.00%)       56.02 ( 12.06%)
      
      ffsb running in a configuration that is meant to simulate a mail server showed
      
                                       3.15.0-rc5             3.15.0-rc5
                                          vanilla          shrinker-v1r1
      Ops/sec readall           9402.63 (  0.00%)      9567.97 (  1.76%)
      Ops/sec create            4695.45 (  0.00%)      4735.00 (  0.84%)
      Ops/sec delete             173.72 (  0.00%)       179.83 (  3.52%)
      Ops/sec Transactions     14271.80 (  0.00%)     14482.81 (  1.48%)
      Ops/sec Read                37.00 (  0.00%)        37.60 (  1.62%)
      Ops/sec Write               18.20 (  0.00%)        18.30 (  0.55%)
      Signed-off-by: default avatarTim Chen <tim.c.chen@linux.intel.com>
      Signed-off-by: default avatarMel Gorman <mgorman@suse.de>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Dave Chinner <david@fromorbit.com>
      Tested-by: default avatarYuanhan Liu <yuanhan.liu@linux.intel.com>
      Cc: Bob Liu <bob.liu@oracle.com>
      Cc: Jan Kara <jack@suse.cz>
      Acked-by: default avatarRik van Riel <riel@redhat.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      d23da150
    • Dave Chinner's avatar
      fs/superblock: unregister sb shrinker before ->kill_sb() · 28f2cd4f
      Dave Chinner authored
      This series is aimed at regressions noticed during reclaim activity.  The
      first two patches are shrinker patches that were posted ages ago but never
      merged for reasons that are unclear to me.  I'm posting them again to see
      if there was a reason they were dropped or if they just got lost.  Dave?
      Time?  The last patch adjusts proportional reclaim.  Yuanhan Liu, can you
      retest the vm scalability test cases on a larger machine?  Hugh, does this
      work for you on the memcg test cases?
      
      Based on ext4, I get the following results but unfortunately my larger
      test machines are all unavailable so this is based on a relatively small
      machine.
      
      postmark
                                        3.15.0-rc5            3.15.0-rc5
                                           vanilla       proportion-v1r4
      Ops/sec Transactions         21.00 (  0.00%)       25.00 ( 19.05%)
      Ops/sec FilesCreate          39.00 (  0.00%)       45.00 ( 15.38%)
      Ops/sec CreateTransact       10.00 (  0.00%)       12.00 ( 20.00%)
      Ops/sec FilesDeleted       6202.00 (  0.00%)     6202.00 (  0.00%)
      Ops/sec DeleteTransact       11.00 (  0.00%)       12.00 (  9.09%)
      Ops/sec DataRead/MB          25.97 (  0.00%)       30.02 ( 15.59%)
      Ops/sec DataWrite/MB         49.99 (  0.00%)       57.78 ( 15.58%)
      
      ffsb (mail server simulator)
                                       3.15.0-rc5             3.15.0-rc5
                                          vanilla        proportion-v1r4
      Ops/sec readall           9402.63 (  0.00%)      9805.74 (  4.29%)
      Ops/sec create            4695.45 (  0.00%)      4781.39 (  1.83%)
      Ops/sec delete             173.72 (  0.00%)       177.23 (  2.02%)
      Ops/sec Transactions     14271.80 (  0.00%)     14764.37 (  3.45%)
      Ops/sec Read                37.00 (  0.00%)        38.50 (  4.05%)
      Ops/sec Write               18.20 (  0.00%)        18.50 (  1.65%)
      
      dd of a large file
                                      3.15.0-rc5            3.15.0-rc5
                                         vanilla       proportion-v1r4
      WallTime DownloadTar       75.00 (  0.00%)       61.00 ( 18.67%)
      WallTime DD               423.00 (  0.00%)      401.00 (  5.20%)
      WallTime Delete             2.00 (  0.00%)        5.00 (-150.00%)
      
      stutter (times mmap latency during large amounts of IO)
      
                                  3.15.0-rc5            3.15.0-rc5
                                     vanilla       proportion-v1r4
      Unit >5ms Delays  80252.0000 (  0.00%)  81523.0000 ( -1.58%)
      Unit Mmap min         8.2118 (  0.00%)      8.3206 ( -1.33%)
      Unit Mmap mean       17.4614 (  0.00%)     17.2868 (  1.00%)
      Unit Mmap stddev     24.9059 (  0.00%)     34.6771 (-39.23%)
      Unit Mmap max      2811.6433 (  0.00%)   2645.1398 (  5.92%)
      Unit Mmap 90%        20.5098 (  0.00%)     18.3105 ( 10.72%)
      Unit Mmap 93%        22.9180 (  0.00%)     20.1751 ( 11.97%)
      Unit Mmap 95%        25.2114 (  0.00%)     22.4988 ( 10.76%)
      Unit Mmap 99%        46.1430 (  0.00%)     43.5952 (  5.52%)
      Unit Ideal  Tput     85.2623 (  0.00%)     78.8906 (  7.47%)
      Unit Tput min        44.0666 (  0.00%)     43.9609 (  0.24%)
      Unit Tput mean       45.5646 (  0.00%)     45.2009 (  0.80%)
      Unit Tput stddev      0.9318 (  0.00%)      1.1084 (-18.95%)
      Unit Tput max        46.7375 (  0.00%)     46.7539 ( -0.04%)
      
      This patch (of 3):
      
      We will like to unregister the sb shrinker before ->kill_sb().  This will
      allow cached objects to be counted without call to grab_super_passive() to
      update ref count on sb.  We want to avoid locking during memory
      reclamation especially when we are skipping the memory reclaim when we are
      out of cached objects.
      
      This is safe because grab_super_passive does a try-lock on the
      sb->s_umount now, and so if we are in the unmount process, it won't ever
      block.  That means what used to be a deadlock and races we were avoiding
      by using grab_super_passive() is now:
      
              shrinker                        umount
      
              down_read(shrinker_rwsem)
                                              down_write(sb->s_umount)
                                              shrinker_unregister
                                                down_write(shrinker_rwsem)
                                                  <blocks>
              grab_super_passive(sb)
                down_read_trylock(sb->s_umount)
                  <fails>
              <shrinker aborts>
              ....
              <shrinkers finish running>
              up_read(shrinker_rwsem)
                                                <unblocks>
                                                <removes shrinker>
                                                up_write(shrinker_rwsem)
                                              ->kill_sb()
                                              ....
      
      So it is safe to deregister the shrinker before ->kill_sb().
      Signed-off-by: default avatarTim Chen <tim.c.chen@linux.intel.com>
      Signed-off-by: default avatarMel Gorman <mgorman@suse.de>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Dave Chinner <david@fromorbit.com>
      Tested-by: default avatarYuanhan Liu <yuanhan.liu@linux.intel.com>
      Cc: Bob Liu <bob.liu@oracle.com>
      Cc: Jan Kara <jack@suse.cz>
      Acked-by: default avatarRik van Riel <riel@redhat.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      28f2cd4f
    • Kirill A. Shutemov's avatar
    • Matthew Wilcox's avatar
      mm/msync.c: sync only the requested range in msync() · 7fc34a62
      Matthew Wilcox authored
      msync() currently syncs more than POSIX requires or BSD or Solaris
      implement.  It is supposed to be equivalent to fdatasync(), not fsync(),
      and it is only supposed to sync the portion of the file that overlaps the
      range passed to msync.
      
      If the VMA is non-linear, fall back to syncing the entire file, but we
      still optimise to only fdatasync() the entire file, not the full fsync().
      
      akpm: there are obvious concerns with bck-compatibility: is anyone relying
      on the undocumented side-effect for their data integrity?  And how would
      they ever know if this change broke their data integrity?
      
      We think the risk is reasonably low, and this patch brings the kernel into
      line with other OS's and with what the manpage has always said...
      Signed-off-by: default avatarMatthew Wilcox <matthew.r.wilcox@intel.com>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Acked-by: default avatarJeff Moyer <jmoyer@redhat.com>
      Cc: Chris Mason <clm@fb.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      7fc34a62
    • Chen Yucong's avatar
      hwpoison: remove unused global variable in do_machine_check() · 65eb7182
      Chen Yucong authored
      Remove an unused global variable mce_entry and relative operations in
      do_machine_check().
      Signed-off-by: default avatarChen Yucong <slaoub@gmail.com>
      Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: Wu Fengguang <fengguang.wu@intel.com>
      Cc: Andi Kleen <andi@firstfloor.org>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      65eb7182
    • Vlastimil Babka's avatar
      mm, compaction: properly signal and act upon lock and need_sched() contention · be976572
      Vlastimil Babka authored
      Compaction uses compact_checklock_irqsave() function to periodically check
      for lock contention and need_resched() to either abort async compaction,
      or to free the lock, schedule and retake the lock.  When aborting,
      cc->contended is set to signal the contended state to the caller.  Two
      problems have been identified in this mechanism.
      
      First, compaction also calls directly cond_resched() in both scanners when
      no lock is yet taken.  This call either does not abort async compaction,
      or set cc->contended appropriately.  This patch introduces a new
      compact_should_abort() function to achieve both.  In isolate_freepages(),
      the check frequency is reduced to once by SWAP_CLUSTER_MAX pageblocks to
      match what the migration scanner does in the preliminary page checks.  In
      case a pageblock is found suitable for calling isolate_freepages_block(),
      the checks within there are done on higher frequency.
      
      Second, isolate_freepages() does not check if isolate_freepages_block()
      aborted due to contention, and advances to the next pageblock.  This
      violates the principle of aborting on contention, and might result in
      pageblocks not being scanned completely, since the scanning cursor is
      advanced.  This problem has been noticed in the code by Joonsoo Kim when
      reviewing related patches.  This patch makes isolate_freepages_block()
      check the cc->contended flag and abort.
      
      In case isolate_freepages() has already isolated some pages before
      aborting due to contention, page migration will proceed, which is OK since
      we do not want to waste the work that has been done, and page migration
      has own checks for contention.  However, we do not want another isolation
      attempt by either of the scanners, so cc->contended flag check is added
      also to compaction_alloc() and compact_finished() to make sure compaction
      is aborted right after the migration.
      
      The outcome of the patch should be reduced lock contention by async
      compaction and lower latencies for higher-order allocations where direct
      compaction is involved.
      
      [akpm@linux-foundation.org: fix typo in comment]
      Reported-by: default avatarJoonsoo Kim <iamjoonsoo.kim@lge.com>
      Signed-off-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Reviewed-by: default avatarNaoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Bartlomiej Zolnierkiewicz <b.zolnierkie@samsung.com>
      Cc: Michal Nazarewicz <mina86@mina86.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Rik van Riel <riel@redhat.com>
      Acked-by: default avatarMichal Nazarewicz <mina86@mina86.com>
      Tested-by: default avatarShawn Guo <shawn.guo@linaro.org>
      Tested-by: default avatarKevin Hilman <khilman@linaro.org>
      Tested-by: default avatarStephen Warren <swarren@nvidia.com>
      Tested-by: default avatarFabio Estevam <fabio.estevam@freescale.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Stephen Rothwell <sfr@canb.auug.org.au>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      be976572
    • Fabian Frederick's avatar
      fs/hugetlbfs/inode.c: remove null test before kfree · 6e6870d4
      Fabian Frederick authored
      Fix checkpatch warning:
      WARNING: kfree(NULL) is safe this check is probably not required
      Signed-off-by: default avatarFabian Frederick <fabf@skynet.be>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      6e6870d4
    • Fabian Frederick's avatar
      be1d2cf5
    • Fabian Frederick's avatar
      fs/hugetlbfs/inode.c: add static to hugetlbfs_i_mmap_mutex_key · 422b2448
      Fabian Frederick authored
      hugetlbfs_i_mmap_mutex_key is only used in inode.c
      Signed-off-by: default avatarFabian Frederick <fabf@skynet.be>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      422b2448
    • Jianyu Zhan's avatar
      mm/vmscan.c: use DIV_ROUND_UP for calculation of zone's balance_gap and correct comments. · 4be89a34
      Jianyu Zhan authored
      Currently, we use (zone->managed_pages + KSWAPD_ZONE_BALANCE_GAP_RATIO-1)
      / KSWAPD_ZONE_BALANCE_GAP_RATIO to avoid a zero gap value.  It's better to
      use DIV_ROUND_UP macro for neater code and clear meaning.
      
      Besides, the gap value is calculated against the per-zone "managed pages",
      not "present pages".  This patch also corrects the comment and do some
      rephrasing.
      Signed-off-by: default avatarJianyu Zhan <nasa4836@gmail.com>
      Acked-by: default avatarRik van Riel <riel@redhat.com>
      Acked-by: default avatarRafael Aquini <aquini@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      4be89a34
    • Andy Shevchenko's avatar
      b7596fb4
    • Jianyu Zhan's avatar
      mm, hugetlb: move the error handle logic out of normal code path · 8f34af6f
      Jianyu Zhan authored
      alloc_huge_page() now mixes normal code path with error handle logic.
      This patches move out the error handle logic, to make normal code path
      more clean and redue code duplicate.
      Signed-off-by: default avatarJianyu Zhan <nasa4836@gmail.com>
      Acked-by: default avatarDavidlohr Bueso <davidlohr@hp.com>
      Reviewed-by: default avatarMichal Hocko <mhocko@suse.cz>
      Reviewed-by: default avatarAneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      8f34af6f
    • Naoya Horiguchi's avatar
      mm/memory-failure.c: move comment · 6edd6cc6
      Naoya Horiguchi authored
      The comment about pages under writeback is far from the relevant code, so
      let's move it to the right place.
      Signed-off-by: default avatarNaoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: Andi Kleen <andi@firstfloor.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      6edd6cc6
    • Mel Gorman's avatar
      mm: avoid unnecessary atomic operations during end_page_writeback() · 888cf2db
      Mel Gorman authored
      If a page is marked for immediate reclaim then it is moved to the tail of
      the LRU list.  This occurs when the system is under enough memory pressure
      for pages under writeback to reach the end of the LRU but we test for this
      using atomic operations on every writeback.  This patch uses an optimistic
      non-atomic test first.  It'll miss some pages in rare cases but the
      consequences are not severe enough to warrant such a penalty.
      
      While the function does not dominate profiles during a simple dd test the
      cost of it is reduced.
      
      73048     0.7428  vmlinux-3.15.0-rc5-mmotm-20140513 end_page_writeback
      23740     0.2409  vmlinux-3.15.0-rc5-lessatomic     end_page_writeback
      Signed-off-by: default avatarMel Gorman <mgorman@suse.de>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      888cf2db
    • Mel Gorman's avatar
      mm: page_alloc: calculate classzone_idx once from the zonelist ref · d8846374
      Mel Gorman authored
      There is no need to calculate zone_idx(preferred_zone) multiple times
      or use the pgdat to figure it out.
      Signed-off-by: default avatarMel Gorman <mgorman@suse.de>
      Acked-by: default avatarRik van Riel <riel@redhat.com>
      Acked-by: default avatarDavid Rientjes <rientjes@google.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: Theodore Ts'o <tytso@mit.edu>
      Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Dan Carpenter <dan.carpenter@oracle.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      d8846374
    • Mel Gorman's avatar
      mm: non-atomically mark page accessed during page cache allocation where possible · 2457aec6
      Mel Gorman authored
      aops->write_begin may allocate a new page and make it visible only to have
      mark_page_accessed called almost immediately after.  Once the page is
      visible the atomic operations are necessary which is noticable overhead
      when writing to an in-memory filesystem like tmpfs but should also be
      noticable with fast storage.  The objective of the patch is to initialse
      the accessed information with non-atomic operations before the page is
      visible.
      
      The bulk of filesystems directly or indirectly use
      grab_cache_page_write_begin or find_or_create_page for the initial
      allocation of a page cache page.  This patch adds an init_page_accessed()
      helper which behaves like the first call to mark_page_accessed() but may
      called before the page is visible and can be done non-atomically.
      
      The primary APIs of concern in this care are the following and are used
      by most filesystems.
      
      	find_get_page
      	find_lock_page
      	find_or_create_page
      	grab_cache_page_nowait
      	grab_cache_page_write_begin
      
      All of them are very similar in detail to the patch creates a core helper
      pagecache_get_page() which takes a flags parameter that affects its
      behavior such as whether the page should be marked accessed or not.  Then
      old API is preserved but is basically a thin wrapper around this core
      function.
      
      Each of the filesystems are then updated to avoid calling
      mark_page_accessed when it is known that the VM interfaces have already
      done the job.  There is a slight snag in that the timing of the
      mark_page_accessed() has now changed so in rare cases it's possible a page
      gets to the end of the LRU as PageReferenced where as previously it might
      have been repromoted.  This is expected to be rare but it's worth the
      filesystem people thinking about it in case they see a problem with the
      timing change.  It is also the case that some filesystems may be marking
      pages accessed that previously did not but it makes sense that filesystems
      have consistent behaviour in this regard.
      
      The test case used to evaulate this is a simple dd of a large file done
      multiple times with the file deleted on each iterations.  The size of the
      file is 1/10th physical memory to avoid dirty page balancing.  In the
      async case it will be possible that the workload completes without even
      hitting the disk and will have variable results but highlight the impact
      of mark_page_accessed for async IO.  The sync results are expected to be
      more stable.  The exception is tmpfs where the normal case is for the "IO"
      to not hit the disk.
      
      The test machine was single socket and UMA to avoid any scheduling or NUMA
      artifacts.  Throughput and wall times are presented for sync IO, only wall
      times are shown for async as the granularity reported by dd and the
      variability is unsuitable for comparison.  As async results were variable
      do to writback timings, I'm only reporting the maximum figures.  The sync
      results were stable enough to make the mean and stddev uninteresting.
      
      The performance results are reported based on a run with no profiling.
      Profile data is based on a separate run with oprofile running.
      
      async dd
                                          3.15.0-rc3            3.15.0-rc3
                                             vanilla           accessed-v2
      ext3    Max      elapsed     13.9900 (  0.00%)     11.5900 ( 17.16%)
      tmpfs	Max      elapsed      0.5100 (  0.00%)      0.4900 (  3.92%)
      btrfs   Max      elapsed     12.8100 (  0.00%)     12.7800 (  0.23%)
      ext4	Max      elapsed     18.6000 (  0.00%)     13.3400 ( 28.28%)
      xfs	Max      elapsed     12.5600 (  0.00%)      2.0900 ( 83.36%)
      
      The XFS figure is a bit strange as it managed to avoid a worst case by
      sheer luck but the average figures looked reasonable.
      
              samples percentage
      ext3       86107    0.9783  vmlinux-3.15.0-rc4-vanilla        mark_page_accessed
      ext3       23833    0.2710  vmlinux-3.15.0-rc4-accessed-v3r25 mark_page_accessed
      ext3        5036    0.0573  vmlinux-3.15.0-rc4-accessed-v3r25 init_page_accessed
      ext4       64566    0.8961  vmlinux-3.15.0-rc4-vanilla        mark_page_accessed
      ext4        5322    0.0713  vmlinux-3.15.0-rc4-accessed-v3r25 mark_page_accessed
      ext4        2869    0.0384  vmlinux-3.15.0-rc4-accessed-v3r25 init_page_accessed
      xfs        62126    1.7675  vmlinux-3.15.0-rc4-vanilla        mark_page_accessed
      xfs         1904    0.0554  vmlinux-3.15.0-rc4-accessed-v3r25 init_page_accessed
      xfs          103    0.0030  vmlinux-3.15.0-rc4-accessed-v3r25 mark_page_accessed
      btrfs      10655    0.1338  vmlinux-3.15.0-rc4-vanilla        mark_page_accessed
      btrfs       2020    0.0273  vmlinux-3.15.0-rc4-accessed-v3r25 init_page_accessed
      btrfs        587    0.0079  vmlinux-3.15.0-rc4-accessed-v3r25 mark_page_accessed
      tmpfs      59562    3.2628  vmlinux-3.15.0-rc4-vanilla        mark_page_accessed
      tmpfs       1210    0.0696  vmlinux-3.15.0-rc4-accessed-v3r25 init_page_accessed
      tmpfs         94    0.0054  vmlinux-3.15.0-rc4-accessed-v3r25 mark_page_accessed
      
      [akpm@linux-foundation.org: don't run init_page_accessed() against an uninitialised pointer]
      Signed-off-by: default avatarMel Gorman <mgorman@suse.de>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: Theodore Ts'o <tytso@mit.edu>
      Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Tested-by: default avatarPrabhakar Lad <prabhakar.csengg@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      2457aec6
    • Mel Gorman's avatar
      fs: buffer: do not use unnecessary atomic operations when discarding buffers · e7470ee8
      Mel Gorman authored
      Discarding buffers uses a bunch of atomic operations when discarding
      buffers because ......  I can't think of a reason.  Use a cmpxchg loop to
      clear all the necessary flags.  In most (all?) cases this will be a single
      atomic operations.
      
      [akpm@linux-foundation.org: move BUFFER_FLAGS_DISCARD into the .c file]
      Signed-off-by: default avatarMel Gorman <mgorman@suse.de>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: Theodore Ts'o <tytso@mit.edu>
      Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      e7470ee8
    • Mel Gorman's avatar
      mm: do not use unnecessary atomic operations when adding pages to the LRU · 6fb81a17
      Mel Gorman authored
      When adding pages to the LRU we clear the active bit unconditionally.
      As the page could be reachable from other paths we cannot use unlocked
      operations without risk of corruption such as a parallel
      mark_page_accessed.  This patch tests if is necessary to clear the
      active flag before using an atomic operation.  This potentially opens a
      tiny race when PageActive is checked as mark_page_accessed could be
      called after PageActive was checked.  The race already exists but this
      patch changes it slightly.  The consequence is that that the page may be
      promoted to the active list that might have been left on the inactive
      list before the patch.  It's too tiny a race and too marginal a
      consequence to always use atomic operations for.
      Signed-off-by: default avatarMel Gorman <mgorman@suse.de>
      Acked-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: Theodore Ts'o <tytso@mit.edu>
      Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      6fb81a17
    • Mel Gorman's avatar
      mm: do not use atomic operations when releasing pages · e3741b50
      Mel Gorman authored
      There should be no references to it any more and a parallel mark should
      not be reordered against us.  Use non-locked varient to clear page active.
      Signed-off-by: default avatarMel Gorman <mgorman@suse.de>
      Acked-by: default avatarRik van Riel <riel@redhat.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: Theodore Ts'o <tytso@mit.edu>
      Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      e3741b50
    • Mel Gorman's avatar
      mm: shmem: avoid atomic operation during shmem_getpage_gfp · 07a42788
      Mel Gorman authored
      shmem_getpage_gfp uses an atomic operation to set the SwapBacked field
      before it's even added to the LRU or visible.  This is unnecessary as what
      could it possible race against?  Use an unlocked variant.
      Signed-off-by: default avatarMel Gorman <mgorman@suse.de>
      Acked-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: default avatarRik van Riel <riel@redhat.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: Theodore Ts'o <tytso@mit.edu>
      Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      07a42788
    • Mel Gorman's avatar
      mm: page_alloc: convert hot/cold parameter and immediate callers to bool · b745bc85
      Mel Gorman authored
      cold is a bool, make it one.  Make the likely case the "if" part of the
      block instead of the else as according to the optimisation manual this is
      preferred.
      Signed-off-by: default avatarMel Gorman <mgorman@suse.de>
      Acked-by: default avatarRik van Riel <riel@redhat.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: Theodore Ts'o <tytso@mit.edu>
      Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      b745bc85
    • Mel Gorman's avatar
      mm: page_alloc: use unsigned int for order in more places · 7aeb09f9
      Mel Gorman authored
      X86 prefers the use of unsigned types for iterators and there is a
      tendency to mix whether a signed or unsigned type if used for page order.
      This converts a number of sites in mm/page_alloc.c to use unsigned int for
      order where possible.
      Signed-off-by: default avatarMel Gorman <mgorman@suse.de>
      Acked-by: default avatarRik van Riel <riel@redhat.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: Theodore Ts'o <tytso@mit.edu>
      Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      7aeb09f9
    • Mel Gorman's avatar
      mm: page_alloc: lookup pageblock migratetype with IRQs enabled during free · cfc47a28
      Mel Gorman authored
      get_pageblock_migratetype() is called during free with IRQs disabled.
      This is unnecessary and disables IRQs for longer than necessary.
      Signed-off-by: default avatarMel Gorman <mgorman@suse.de>
      Acked-by: default avatarRik van Riel <riel@redhat.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Acked-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: Theodore Ts'o <tytso@mit.edu>
      Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      cfc47a28
    • Mel Gorman's avatar
      mm: page_alloc: reduce number of times page_to_pfn is called · dc4b0caf
      Mel Gorman authored
      In the free path we calculate page_to_pfn multiple times. Reduce that.
      Signed-off-by: default avatarMel Gorman <mgorman@suse.de>
      Acked-by: default avatarRik van Riel <riel@redhat.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Acked-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: Theodore Ts'o <tytso@mit.edu>
      Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      dc4b0caf
    • Mel Gorman's avatar
      mm: page_alloc: use word-based accesses for get/set pageblock bitmaps · e58469ba
      Mel Gorman authored
      The test_bit operations in get/set pageblock flags are expensive.  This
      patch reads the bitmap on a word basis and use shifts and masks to isolate
      the bits of interest.  Similarly masks are used to set a local copy of the
      bitmap and then use cmpxchg to update the bitmap if there have been no
      other changes made in parallel.
      
      In a test running dd onto tmpfs the overhead of the pageblock-related
      functions went from 1.27% in profiles to 0.5%.
      
      In addition to the performance benefits, this patch closes races that are
      possible between:
      
      a) get_ and set_pageblock_migratetype(), where get_pageblock_migratetype()
         reads part of the bits before and other part of the bits after
         set_pageblock_migratetype() has updated them.
      
      b) set_pageblock_migratetype() and set_pageblock_skip(), where the non-atomic
         read-modify-update set bit operation in set_pageblock_skip() will cause
         lost updates to some bits changed in the set_pageblock_migratetype().
      
      Joonsoo Kim first reported the case a) via code inspection.  Vlastimil
      Babka's testing with a debug patch showed that either a) or b) occurs
      roughly once per mmtests' stress-highalloc benchmark (although not
      necessarily in the same pageblock).  Furthermore during development of
      unrelated compaction patches, it was observed that frequent calls to
      {start,undo}_isolate_page_range() the race occurs several thousands of
      times and has resulted in NULL pointer dereferences in move_freepages()
      and free_one_page() in places where free_list[migratetype] is
      manipulated by e.g.  list_move().  Further debugging confirmed that
      migratetype had invalid value of 6, causing out of bounds access to the
      free_list array.
      
      That confirmed that the race exist, although it may be extremely rare,
      and currently only fatal where page isolation is performed due to
      memory hot remove.  Races on pageblocks being updated by
      set_pageblock_migratetype(), where both old and new migratetype are
      lower MIGRATE_RESERVE, currently cannot result in an invalid value
      being observed, although theoretically they may still lead to
      unexpected creation or destruction of MIGRATE_RESERVE pageblocks.
      Furthermore, things could get suddenly worse when memory isolation is
      used more, or when new migratetypes are added.
      
      After this patch, the race has no longer been observed in testing.
      Signed-off-by: default avatarMel Gorman <mgorman@suse.de>
      Acked-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Reported-by: default avatarJoonsoo Kim <iamjoonsoo.kim@lge.com>
      Reported-and-tested-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: Theodore Ts'o <tytso@mit.edu>
      Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      e58469ba
    • Mel Gorman's avatar
      mm: page_alloc: take the ALLOC_NO_WATERMARK check out of the fast path · 5dab2911
      Mel Gorman authored
      ALLOC_NO_WATERMARK is set in a few cases.  Always by kswapd, always for
      __GFP_MEMALLOC, sometimes for swap-over-nfs, tasks etc.  Each of these
      cases are relatively rare events but the ALLOC_NO_WATERMARK check is an
      unlikely branch in the fast path.  This patch moves the check out of the
      fast path and after it has been determined that the watermarks have not
      been met.  This helps the common fast path at the cost of making the slow
      path slower and hitting kswapd with a performance cost.  It's a reasonable
      tradeoff.
      Signed-off-by: default avatarMel Gorman <mgorman@suse.de>
      Acked-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Reviewed-by: default avatarRik van Riel <riel@redhat.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: Theodore Ts'o <tytso@mit.edu>
      Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      5dab2911