1. 14 Jan, 2011 40 commits
    • Mel Gorman's avatar
      mm: kswapd: treat zone->all_unreclaimable in sleeping_prematurely similar to balance_pgdat() · 355b09c4
      Mel Gorman authored
      After DEF_PRIORITY, balance_pgdat() considers all_unreclaimable zones to
      be balanced but sleeping_prematurely does not.  This can force kswapd to
      stay awake longer than it should.  This patch fixes it.
      Signed-off-by: default avatarMel Gorman <mel@csn.ul.ie>
      Reviewed-by: default avatarEric B Munson <emunson@mgebm.net>
      Cc: Minchan Kim <minchan.kim@gmail.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Simon Kirby <sim@hostway.ca>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Shaohua Li <shaohua.li@intel.com>
      Cc: Dave Hansen <dave@linux.vnet.ibm.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Rik van Riel <riel@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      355b09c4
    • Mel Gorman's avatar
      mm: kswapd: reset kswapd_max_order and classzone_idx after reading · 4d40502e
      Mel Gorman authored
      When kswapd wakes up, it reads its order and classzone from pgdat and
      calls balance_pgdat.  While its awake, it potentially reclaimes at a high
      order and a low classzone index.  This might have been a once-off that was
      not required by subsequent callers.  However, because the pgdat values
      were not reset, they remain artifically high while balance_pgdat() is
      running and potentially kswapd enters a second unnecessary reclaim cycle.
      Reset the pgdat order and classzone index after reading.
      Signed-off-by: default avatarMel Gorman <mel@csn.ul.ie>
      Reviewed-by: default avatarMinchan Kim <minchan.kim@gmail.com>
      Reviewed-by: default avatarKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Reviewed-by: default avatarEric B Munson <emunson@mgebm.net>
      Cc: Simon Kirby <sim@hostway.ca>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Shaohua Li <shaohua.li@intel.com>
      Cc: Dave Hansen <dave@linux.vnet.ibm.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Rik van Riel <riel@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      4d40502e
    • Mel Gorman's avatar
      mm: kswapd: use the order that kswapd was reclaiming at for sleeping_prematurely() · 0abdee2b
      Mel Gorman authored
      Before kswapd goes to sleep, it uses sleeping_prematurely() to check if
      there was a race pushing a zone below its watermark.  If the race
      happened, it stays awake.  However, balance_pgdat() can decide to reclaim
      at order-0 if it decides that high-order reclaim is not working as
      expected.  This information is not passed back to sleeping_prematurely().
      The impact is that kswapd remains awake reclaiming pages long after it
      should have gone to sleep.  This patch passes the adjusted order to
      sleeping_prematurely and uses the same logic as balance_pgdat to decide if
      it's ok to go to sleep.
      Signed-off-by: default avatarMel Gorman <mel@csn.ul.ie>
      Reviewed-by: default avatarMinchan Kim <minchan.kim@gmail.com>
      Reviewed-by: default avatarKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Reviewed-by: default avatarEric B Munson <emunson@mgebm.net>
      Cc: Simon Kirby <sim@hostway.ca>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Shaohua Li <shaohua.li@intel.com>
      Cc: Dave Hansen <dave@linux.vnet.ibm.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Rik van Riel <riel@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      0abdee2b
    • Mel Gorman's avatar
      mm: kswapd: keep kswapd awake for high-order allocations until a percentage of the node is balanced · 1741c877
      Mel Gorman authored
      When reclaiming for high-orders, kswapd is responsible for balancing a
      node but it should not reclaim excessively.  It avoids excessive reclaim
      by considering if any zone in a node is balanced then the node is
      balanced.  In the cases where there are imbalanced zone sizes (e.g.
      ZONE_DMA with both ZONE_DMA32 and ZONE_NORMAL), kswapd can go to sleep
      prematurely as just one small zone was balanced.
      
      This alters the sleep logic of kswapd slightly.  It counts the number of
      pages that make up the balanced zones.  If the total number of balanced
      pages is more than a quarter of the zone, kswapd will go back to sleep.
      This should keep a node balanced without reclaiming an excessive number of
      pages.
      Signed-off-by: default avatarMel Gorman <mel@csn.ul.ie>
      Reviewed-by: default avatarMinchan Kim <minchan.kim@gmail.com>
      Reviewed-by: default avatarKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Reviewed-by: default avatarEric B Munson <emunson@mgebm.net>
      Cc: Simon Kirby <sim@hostway.ca>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Shaohua Li <shaohua.li@intel.com>
      Cc: Dave Hansen <dave@linux.vnet.ibm.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Rik van Riel <riel@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      1741c877
    • Mel Gorman's avatar
      mm: kswapd: stop high-order balancing when any suitable zone is balanced · 99504748
      Mel Gorman authored
      Simon Kirby reported the following problem
      
         We're seeing cases on a number of servers where cache never fully
         grows to use all available memory.  Sometimes we see servers with 4 GB
         of memory that never seem to have less than 1.5 GB free, even with a
         constantly-active VM.  In some cases, these servers also swap out while
         this happens, even though they are constantly reading the working set
         into memory.  We have been seeing this happening for a long time; I
         don't think it's anything recent, and it still happens on 2.6.36.
      
      After some debugging work by Simon, Dave Hansen and others, the prevaling
      theory became that kswapd is reclaiming order-3 pages requested by SLUB
      too aggressive about it.
      
      There are two apparent problems here.  On the target machine, there is a
      small Normal zone in comparison to DMA32.  As kswapd tries to balance all
      zones, it would continually try reclaiming for Normal even though DMA32
      was balanced enough for callers.  The second problem is that
      sleeping_prematurely() does not use the same logic as balance_pgdat() when
      deciding whether to sleep or not.  This keeps kswapd artifically awake.
      
      A number of tests were run and the figures from previous postings will
      look very different for a few reasons.  One, the old figures were forcing
      my network card to use GFP_ATOMIC in attempt to replicate Simon's problem.
       Second, I previous specified slub_min_order=3 again in an attempt to
      reproduce Simon's problem.  In this posting, I'm depending on Simon to say
      whether his problem is fixed or not and these figures are to show the
      impact to the ordinary cases.  Finally, the "vmscan" figures are taken
      from /proc/vmstat instead of the tracepoints.  There is less information
      but recording is less disruptive.
      
      The first test of relevance was postmark with a process running in the
      background reading a large amount of anonymous memory in blocks.  The
      objective was to vaguely simulate what was happening on Simon's machine
      and it's memory intensive enough to have kswapd awake.
      
      POSTMARK
                                                  traceonly          kanyzone
      Transactions per second:              156.00 ( 0.00%)   153.00 (-1.96%)
      Data megabytes read per second:        21.51 ( 0.00%)    21.52 ( 0.05%)
      Data megabytes written per second:     29.28 ( 0.00%)    29.11 (-0.58%)
      Files created alone per second:       250.00 ( 0.00%)   416.00 (39.90%)
      Files create/transact per second:      79.00 ( 0.00%)    76.00 (-3.95%)
      Files deleted alone per second:       520.00 ( 0.00%)   420.00 (-23.81%)
      Files delete/transact per second:      79.00 ( 0.00%)    76.00 (-3.95%)
      
      MMTests Statistics: duration
      User/Sys Time Running Test (seconds)         16.58      17.4
      Total Elapsed Time (seconds)                218.48    222.47
      
      VMstat Reclaim Statistics: vmscan
      Direct reclaims                                  0          4
      Direct reclaim pages scanned                     0        203
      Direct reclaim pages reclaimed                   0        184
      Kswapd pages scanned                        326631     322018
      Kswapd pages reclaimed                      312632     309784
      Kswapd low wmark quickly                         1          4
      Kswapd high wmark quickly                      122        475
      Kswapd skip congestion_wait                      1          0
      Pages activated                             700040     705317
      Pages deactivated                           212113     203922
      Pages written                                 9875       6363
      
      Total pages scanned                         326631    322221
      Total pages reclaimed                       312632    309968
      %age total pages scanned/reclaimed          95.71%    96.20%
      %age total pages scanned/written             3.02%     1.97%
      
      proc vmstat: Faults
      Major Faults                                   300       254
      Minor Faults                                645183    660284
      Page ins                                    493588    486704
      Page outs                                  4960088   4986704
      Swap ins                                      1230       661
      Swap outs                                     9869      6355
      
      Performance is mildly affected because kswapd is no longer doing as much
      work and the background memory consumer process is getting in the way.
      Note that kswapd scanned and reclaimed fewer pages as it's less aggressive
      and overall fewer pages were scanned and reclaimed.  Swap in/out is
      particularly reduced again reflecting kswapd throwing out fewer pages.
      
      The slight performance impact is unfortunate here but it looks like a
      direct result of kswapd being less aggressive.  As the bug report is about
      too many pages being freed by kswapd, it may have to be accepted for now.
      
      The second test is a streaming IO benchmark that was previously used by
      Johannes to show regressions in page reclaim.
      
      MICRO
      					 traceonly  kanyzone
      User/Sys Time Running Test (seconds)         29.29     28.87
      Total Elapsed Time (seconds)                492.18    488.79
      
      VMstat Reclaim Statistics: vmscan
      Direct reclaims                               2128       1460
      Direct reclaim pages scanned               2284822    1496067
      Direct reclaim pages reclaimed              148919     110937
      Kswapd pages scanned                      15450014   16202876
      Kswapd pages reclaimed                     8503697    8537897
      Kswapd low wmark quickly                      3100       3397
      Kswapd high wmark quickly                     1860       7243
      Kswapd skip congestion_wait                    708        801
      Pages activated                               9635       9573
      Pages deactivated                             1432       1271
      Pages written                                  223       1130
      
      Total pages scanned                       17734836  17698943
      Total pages reclaimed                      8652616   8648834
      %age total pages scanned/reclaimed          48.79%    48.87%
      %age total pages scanned/written             0.00%     0.01%
      
      proc vmstat: Faults
      Major Faults                                   165       221
      Minor Faults                               9655785   9656506
      Page ins                                      3880      7228
      Page outs                                 37692940  37480076
      Swap ins                                         0        69
      Swap outs                                       19        15
      
      Again fewer pages are scanned and reclaimed as expected and this time the
      test completed faster.  Note that kswapd is hitting its watermarks faster
      (low and high wmark quickly) which I expect is due to kswapd reclaiming
      fewer pages.
      
      I also ran fs-mark, iozone and sysbench but there is nothing interesting
      to report in the figures.  Performance is not significantly changed and
      the reclaim statistics look reasonable.
      
      Tgis patch:
      
      When the allocator enters its slow path, kswapd is woken up to balance the
      node.  It continues working until all zones within the node are balanced.
      For order-0 allocations, this makes perfect sense but for higher orders it
      can have unintended side-effects.  If the zone sizes are imbalanced,
      kswapd may reclaim heavily within a smaller zone discarding an excessive
      number of pages.  The user-visible behaviour is that kswapd is awake and
      reclaiming even though plenty of pages are free from a suitable zone.
      
      This patch alters the "balance" logic for high-order reclaim allowing
      kswapd to stop if any suitable zone becomes balanced to reduce the number
      of pages it reclaims from other zones.  kswapd still tries to ensure that
      order-0 watermarks for all zones are met before sleeping.
      Signed-off-by: default avatarMel Gorman <mel@csn.ul.ie>
      Reviewed-by: default avatarMinchan Kim <minchan.kim@gmail.com>
      Reviewed-by: default avatarKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Reviewed-by: default avatarEric B Munson <emunson@mgebm.net>
      Cc: Simon Kirby <sim@hostway.ca>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Shaohua Li <shaohua.li@intel.com>
      Cc: Dave Hansen <dave@linux.vnet.ibm.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Rik van Riel <riel@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      99504748
    • Steven Rostedt's avatar
      mm: remove likely() from grab_cache_page_write_begin() · c585a267
      Steven Rostedt authored
      Running the annotated branch profiler on a box doing average work
      (firefox, evolution, xchat, distcc farm), the likely() used in
      grab_cache_page_write_begin() was incorrect most of the time:
      
       correct incorrect  %        Function                  File              Line
       ------- ---------  -        --------                  ----              ----
       1924262 71332401  97 grab_cache_page_write_begin    filemap.c           2206
      
      Adding a trace_printk() and running the function tracer limited to
      just this function I can see:
      
              gconfd-2-2696  [000]  4467.268935: grab_cache_page_write_begin: page=          (null) mapping=ffff8800676a9460 index=7
              gconfd-2-2696  [000]  4467.268946: grab_cache_page_write_begin <-ext3_write_begin
              gconfd-2-2696  [000]  4467.268947: grab_cache_page_write_begin: page=          (null) mapping=ffff8800676a9460 index=8
              gconfd-2-2696  [000]  4467.268959: grab_cache_page_write_begin <-ext3_write_begin
              gconfd-2-2696  [000]  4467.268960: grab_cache_page_write_begin: page=          (null) mapping=ffff8800676a9460 index=9
              gconfd-2-2696  [000]  4467.268972: grab_cache_page_write_begin <-ext3_write_begin
              gconfd-2-2696  [000]  4467.268973: grab_cache_page_write_begin: page=          (null) mapping=ffff8800676a9460 index=10
              gconfd-2-2696  [000]  4467.268991: grab_cache_page_write_begin <-ext3_write_begin
              gconfd-2-2696  [000]  4467.268992: grab_cache_page_write_begin: page=          (null) mapping=ffff8800676a9460 index=11
              gconfd-2-2696  [000]  4467.269005: grab_cache_page_write_begin <-ext3_write_begin
      
      Which shows that a lot of calls from ext3_write_begin will result in the
      page returned by "find_lock_page" will be NULL.
      Signed-off-by: default avatarSteven Rostedt <rostedt@goodmis.org>
      Acked-by: default avatarNick Piggin <npiggin@kernel.dk>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      c585a267
    • Steven Rostedt's avatar
      mm: remove unlikely() from page_mapping() · e20e8779
      Steven Rostedt authored
      page_mapping() has a unlikely that the mapping has PAGE_MAPPING_ANON set.
      But running the annotated branch profiler on a normal desktop system doing
      vairous tasks (xchat, evolution, firefox, distcc), it is not really that
      unlikely that the mapping here will have the PAGE_MAPPING_ANON flag set:
      
       correct incorrect  %        Function                  File              Line
       ------- ---------  -        --------                  ----              ----
      35935762 1270265395  97 page_mapping                   mm.h                 659
      1306198001   143659   0 page_mapping                   mm.h                 657
      203131478   121586   0 page_mapping                   mm.h                 657
       5415491     1116   0 page_mapping                   mm.h                 657
      74899487     1116   0 page_mapping                   mm.h                 657
      203132845      224   0 page_mapping                   mm.h                 659
       5415464       27   0 page_mapping                   mm.h                 659
         13552        0   0 page_mapping                   mm.h                 657
         13552        0   0 page_mapping                   mm.h                 659
        242630        0   0 page_mapping                   mm.h                 657
        242630        0   0 page_mapping                   mm.h                 659
      74899487        0   0 page_mapping                   mm.h                 659
      
      The page_mapping() is a static inline, which is why it shows up multiple
      times.
      
      The unlikely in page_mapping() was correct a total of 1909540379 times and
      incorrect 1270533123 times, with a 39% being incorrect.  With this much of
      an error, it's best to simply remove the unlikely and have the compiler
      and branch prediction figure this out.
      Signed-off-by: default avatarSteven Rostedt <rostedt@goodmis.org>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Nick Piggin <npiggin@kernel.dk>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      e20e8779
    • Steven Rostedt's avatar
      mm: remove likely() from mapping_unevictable() · 088e5465
      Steven Rostedt authored
      The mapping_unevictable() has a likely() around the mapping parameter.
      This mapping parameter comes from page_mapping() which has an unlikely()
      that the page will be set as PAGE_MAPPING_ANON, and if so, it will return
      NULL.  One would think that this unlikely() means that the mapping
      returned by page_mapping() would not be NULL, but where page_mapping() is
      used just above mapping_unevictable(), that unlikely() is incorrect most
      of the time.  This means that the "likely(mapping)" in
      mapping_unevictable() is incorrect most of the time.
      
      Running the annotated branch profiler on my main box which runs firefox,
      evolution, xchat and is part of my distcc farm, I had this:
      
       correct incorrect  %        Function                  File              Line
       ------- ---------  -        --------                  ----              ----
      12872836 1269443893  98 mapping_unevictable            pagemap.h            51
      35935762 1270265395  97 page_mapping                   mm.h                 659
      1306198001   143659   0 page_mapping                   mm.h                 657
      203131478   121586   0 page_mapping                   mm.h                 657
       5415491     1116   0 page_mapping                   mm.h                 657
      74899487     1116   0 page_mapping                   mm.h                 657
      203132845      224   0 page_mapping                   mm.h                 659
       5415464       27   0 page_mapping                   mm.h                 659
         13552        0   0 page_mapping                   mm.h                 657
         13552        0   0 page_mapping                   mm.h                 659
        242630        0   0 page_mapping                   mm.h                 657
        242630        0   0 page_mapping                   mm.h                 659
      74899487        0   0 page_mapping                   mm.h                 659
      
      The page_mapping() is a static inline, which is why it shows up multiple
      times.  The mapping_unevictable() is also a static inline but seems to be
      used only once in my setup.
      
      The unlikely in page_mapping() was correct a total of 1909540379 times and
      incorrect 1270533123 times, with a 39% being incorrect.  Perhaps this is
      enough to remove the unlikely from page_mapping() as well.
      Signed-off-by: default avatarSteven Rostedt <rostedt@goodmis.org>
      Reviewed-by: default avatarKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Acked-by: default avatarNick Piggin <npiggin@kernel.dk>
      Acked-by: default avatarRik van Riel <riel@redhat.com>
      Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      088e5465
    • Tobias Klauser's avatar
      vmalloc: remove redundant unlikely() · ddf9c6d4
      Tobias Klauser authored
      IS_ERR() already implies unlikely(), so it can be omitted here.
      Signed-off-by: default avatarTobias Klauser <tklauser@distanz.ch>
      Reviewed-by: default avatarMinchan Kim <minchan.kim@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      ddf9c6d4
    • KOSAKI Motohiro's avatar
      mempolicy: remove tasklist_lock from migrate_pages · 1e50df39
      KOSAKI Motohiro authored
      Today, tasklist_lock in migrate_pages doesn't protect anything.
      rcu_read_lock() provide enough protection from pid hash walk.
      Signed-off-by: default avatarKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Reported-by: default avatarPeter Zijlstra <peterz@infradead.org>
      Cc: Christoph Lameter <cl@linux-foundation.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      1e50df39
    • Michel Lespinasse's avatar
      mlock: do not hold mmap_sem for extended periods of time · 53a7706d
      Michel Lespinasse authored
      __get_user_pages gets a new 'nonblocking' parameter to signal that the
      caller is prepared to re-acquire mmap_sem and retry the operation if
      needed.  This is used to split off long operations if they are going to
      block on a disk transfer, or when we detect contention on the mmap_sem.
      
      [akpm@linux-foundation.org: remove ref to rwsem_is_contended()]
      Signed-off-by: default avatarMichel Lespinasse <walken@google.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Nick Piggin <npiggin@kernel.dk>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: David Howells <dhowells@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      53a7706d
    • Michel Lespinasse's avatar
      mm: move VM_LOCKED check to __mlock_vma_pages_range() · 5fdb2002
      Michel Lespinasse authored
      Use a single code path for faulting in pages during mlock.
      
      The reason to have it in this patch series is that I did not want to
      update both code paths in a later change that releases mmap_sem when
      blocking on disk.
      Signed-off-by: default avatarMichel Lespinasse <walken@google.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Nick Piggin <npiggin@kernel.dk>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: David Howells <dhowells@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      5fdb2002
    • Michel Lespinasse's avatar
      mm: add FOLL_MLOCK follow_page flag. · 110d74a9
      Michel Lespinasse authored
      Move the code to mlock pages from __mlock_vma_pages_range() to
      follow_page().
      
      This allows __mlock_vma_pages_range() to not have to break down work into
      16-page batches.
      
      An additional motivation for doing this within the present patch series is
      that it'll make it easier for a later chagne to drop mmap_sem when
      blocking on disk (we'd like to be able to resume at the page that was read
      from disk instead of at the start of a 16-page batch).
      Signed-off-by: default avatarMichel Lespinasse <walken@google.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Nick Piggin <npiggin@kernel.dk>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: David Howells <dhowells@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      110d74a9
    • Michel Lespinasse's avatar
      mlock: only hold mmap_sem in shared mode when faulting in pages · fed067da
      Michel Lespinasse authored
      Currently mlock() holds mmap_sem in exclusive mode while the pages get
      faulted in.  In the case of a large mlock, this can potentially take a
      very long time, during which various commands such as 'ps auxw' will
      block.  This makes sysadmins unhappy:
      
      real    14m36.232s
      user    0m0.003s
      sys     0m0.015s
      (output from 'time ps auxw' while a 20GB file was being mlocked without
      being previously preloaded into page cache)
      
      I propose that mlock() could release mmap_sem after the VM_LOCKED bits
      have been set in all appropriate VMAs.  Then a second pass could be done
      to actually mlock the pages, in small batches, releasing mmap_sem when we
      block on disk access or when we detect some contention.
      
      This patch:
      
      Before this change, mlock() holds mmap_sem in exclusive mode while the
      pages get faulted in.  In the case of a large mlock, this can potentially
      take a very long time.  Various things will block while mmap_sem is held,
      including 'ps auxw'.  This can make sysadmins angry.
      
      I propose that mlock() could release mmap_sem after the VM_LOCKED bits
      have been set in all appropriate VMAs.  Then a second pass could be done
      to actually mlock the pages with mmap_sem held for reads only.  We need to
      recheck the vma flags after we re-acquire mmap_sem, but this is easy.
      
      In the case where a vma has been munlocked before mlock completes, pages
      that were already marked as PageMlocked() are handled by the munlock()
      call, and mlock() is careful to not mark new page batches as PageMlocked()
      after the munlock() call has cleared the VM_LOCKED vma flags.  So, the end
      result will be identical to what'd happen if munlock() had executed after
      the mlock() call.
      
      In a later change, I will allow the second pass to release mmap_sem when
      blocking on disk accesses or when it is otherwise contended, so that it
      won't be held for long periods of time even in shared mode.
      Signed-off-by: default avatarMichel Lespinasse <walken@google.com>
      Tested-by: default avatarValdis Kletnieks <Valdis.Kletnieks@vt.edu>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Nick Piggin <npiggin@kernel.dk>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: David Howells <dhowells@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      fed067da
    • Michel Lespinasse's avatar
      mlock: avoid dirtying pages and triggering writeback · 5ecfda04
      Michel Lespinasse authored
      When faulting in pages for mlock(), we want to break COW for anonymous or
      file pages within VM_WRITABLE, non-VM_SHARED vmas.  However, there is no
      need to write-fault into VM_SHARED vmas since shared file pages can be
      mlocked first and dirtied later, when/if they actually get written to.
      Skipping the write fault is desirable, as we don't want to unnecessarily
      cause these pages to be dirtied and queued for writeback.
      Signed-off-by: default avatarMichel Lespinasse <walken@google.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Kosaki Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Nick Piggin <npiggin@kernel.dk>
      Cc: Theodore Tso <tytso@google.com>
      Cc: Michael Rubin <mrubin@google.com>
      Cc: Suleiman Souhlal <suleiman@google.com>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Christoph Hellwig <hch@infradead.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      5ecfda04
    • Michel Lespinasse's avatar
      do_wp_page: clarify dirty_page handling · 72ddc8f7
      Michel Lespinasse authored
      Reorganize the code so that dirty pages are handled closer to the place
      that makes them dirty (handling write fault into shared, writable VMAs).
      No behavior changes.
      Signed-off-by: default avatarMichel Lespinasse <walken@google.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Kosaki Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Nick Piggin <npiggin@kernel.dk>
      Cc: Theodore Tso <tytso@google.com>
      Cc: Michael Rubin <mrubin@google.com>
      Cc: Suleiman Souhlal <suleiman@google.com>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Christoph Hellwig <hch@infradead.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      72ddc8f7
    • Michel Lespinasse's avatar
      do_wp_page: remove the 'reuse' flag · b009c024
      Michel Lespinasse authored
      mlocking a shared, writable vma currently causes the corresponding pages
      to be marked as dirty and queued for writeback.  This seems rather
      unnecessary given that the pages are not being actually modified during
      mlock.  It is understood that for non-shared mappings (file or anon) we
      want to use a write fault in order to break COW, but there is just no such
      need for shared mappings.
      
      The first two patches in this series do not introduce any behavior change.
       The intent there is to make it obvious that dirtying file pages is only
      done in the (writable, shared) case.  I think this clarifies the code, but
      I wouldn't mind dropping these two patches if there is no consensus about
      them.
      
      The last patch is where we actually avoid dirtying shared mappings during
      mlock.  Note that as a side effect of this, we won't call page_mkwrite()
      for the mappings that define it, and won't be pre-allocating data blocks
      at the FS level if the mapped file was sparsely allocated.  My
      understanding is that mlock does not need to provide such guarantee, as
      evidenced by the fact that it never did for the filesystems that don't
      define page_mkwrite() - including some common ones like ext3.  However, I
      would like to gather feedback on this from filesystem people as a
      precaution.  If this turns out to be a showstopper, maybe block
      preallocation can be added back on using a different interface.
      
      Large shared mlocks are getting significantly (>2x) faster in my tests, as
      the disk can be fully used for reading the file instead of having to share
      between this and writeback.
      
      This patch:
      
      Reorganize the code to remove the 'reuse' flag.  No behavior changes.
      Signed-off-by: default avatarMichel Lespinasse <walken@google.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Kosaki Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Nick Piggin <npiggin@kernel.dk>
      Cc: Theodore Tso <tytso@google.com>
      Cc: Michael Rubin <mrubin@google.com>
      Cc: Suleiman Souhlal <suleiman@google.com>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Christoph Hellwig <hch@infradead.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      b009c024
    • Rik van Riel's avatar
      mm: clear PageError bit in msync & fsync · 212260aa
      Rik van Riel authored
      Temporary IO failures, eg.  due to loss of both multipath paths, can
      permanently leave the PageError bit set on a page, resulting in msync or
      fsync returning -EIO over and over again, even if IO is now getting to the
      disk correctly.
      
      We already clear the AS_ENOSPC and AS_IO bits in mapping->flags in the
      filemap_fdatawait_range function.  Also clearing the PageError bit on the
      page allows subsequent msync or fsync calls on this file to return without
      an error, if the subsequent IO succeeds.
      
      Unfortunately data written out in the msync or fsync call that returned
      -EIO can still get lost, because the page dirty bit appears to not get
      restored on IO error.  However, the alternative could be potentially all
      of memory filling up with uncleanable dirty pages, hanging the system, so
      there is no nice choice here...
      Signed-off-by: default avatarRik van Riel <riel@redhat.com>
      Acked-by: default avatarValerie Aurora <vaurora@redhat.com>
      Acked-by: default avatarJeff Layton <jlayton@redhat.com>
      Cc: Theodore Ts'o <tytso@mit.edu>
      Acked-by: default avatarJan Kara <jack@suse.cz>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      212260aa
    • Mandeep Singh Baines's avatar
      oom: allow a non-CAP_SYS_RESOURCE proces to oom_score_adj down · dabb16f6
      Mandeep Singh Baines authored
      We'd like to be able to oom_score_adj a process up/down as it
      enters/leaves the foreground.  Currently, it is not possible to oom_adj
      down without CAP_SYS_RESOURCE.  This patch allows a task to decrease its
      oom_score_adj back to the value that a CAP_SYS_RESOURCE thread set it to
      or its inherited value at fork.  Assuming the thread that has forked it
      has oom_score_adj of 0, each process could decrease it back from 0 upon
      activation unless a CAP_SYS_RESOURCE thread elevated it to something
      higher.
      
      Alternative considered:
      
      * a setuid binary
      * a daemon with CAP_SYS_RESOURCE
      
      Since you don't wan't all processes to be able to reduce their oom_adj, a
      setuid or daemon implementation would be complex.  The alternatives also
      have much higher overhead.
      
      This patch updated from original patch based on feedback from David
      Rientjes.
      Signed-off-by: default avatarMandeep Singh Baines <msb@chromium.org>
      Acked-by: default avatarDavid Rientjes <rientjes@google.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Ying Han <yinghan@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      dabb16f6
    • David Rientjes's avatar
      mm: unify module_alloc code for vmalloc · d0a21265
      David Rientjes authored
      Four architectures (arm, mips, sparc, x86) use __vmalloc_area() for
      module_init().  Much of the code is duplicated and can be generalized in a
      globally accessible function, __vmalloc_node_range().
      
      __vmalloc_node() now calls into __vmalloc_node_range() with a range of
      [VMALLOC_START, VMALLOC_END) for functionally equivalent behavior.
      
      Each architecture may then use __vmalloc_node_range() directly to remove
      the duplication of code.
      Signed-off-by: default avatarDavid Rientjes <rientjes@google.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Russell King <linux@arm.linux.org.uk>
      Cc: Ralf Baechle <ralf@linux-mips.org>
      Cc: "David S. Miller" <davem@davemloft.net>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      d0a21265
    • David Rientjes's avatar
      mm: remove gfp mask from pcpu_get_vm_areas · ec3f64fc
      David Rientjes authored
      pcpu_get_vm_areas() only uses GFP_KERNEL allocations, so remove the gfp_t
      formal and use the mask internally.
      Signed-off-by: default avatarDavid Rientjes <rientjes@google.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Tejun Heo <tj@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      ec3f64fc
    • David Rientjes's avatar
      mm: remove unused get_vm_area_node · e5a5623b
      David Rientjes authored
      get_vm_area_node() is unused in the kernel and can thus be removed.
      Signed-off-by: default avatarDavid Rientjes <rientjes@google.com>
      Cc: Christoph Lameter <cl@linux.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      e5a5623b
    • Mel Gorman's avatar
      mm: vmscan: rename lumpy_mode to reclaim_mode · f3a310bc
      Mel Gorman authored
      With compaction being used instead of lumpy reclaim, the name lumpy_mode
      and associated variables is a bit misleading.  Rename lumpy_mode to
      reclaim_mode which is a better fit.  There is no functional change.
      Signed-off-by: default avatarMel Gorman <mel@csn.ul.ie>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Rik van Riel <riel@redhat.com>
      Acked-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Cc: Andy Whitcroft <apw@shadowen.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      f3a310bc
    • Mel Gorman's avatar
      mm: compaction: perform a faster migration scan when migrating asynchronously · 9927af74
      Mel Gorman authored
      try_to_compact_pages() is initially called to only migrate pages
      asychronously and kswapd always compacts asynchronously.  Both are being
      optimistic so it is important to complete the work as quickly as possible
      to minimise stalls.
      
      This patch alters the scanner when asynchronous to only consider
      MIGRATE_MOVABLE pageblocks as migration candidates.  This reduces stalls
      when allocating huge pages while not impairing allocation success rates as
      a full scan will be performed if necessary after direct reclaim.
      Signed-off-by: default avatarMel Gorman <mel@csn.ul.ie>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Rik van Riel <riel@redhat.com>
      Acked-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Cc: Andy Whitcroft <apw@shadowen.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      9927af74
    • Mel Gorman's avatar
      mm: migration: cleanup migrate_pages API by matching types for offlining and sync · 7f0f2496
      Mel Gorman authored
      With the introduction of the boolean sync parameter, the API looks a
      little inconsistent as offlining is still an int.  Convert offlining to a
      bool for the sake of being tidy.
      Signed-off-by: default avatarMel Gorman <mel@csn.ul.ie>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Rik van Riel <riel@redhat.com>
      Acked-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Cc: Andy Whitcroft <apw@shadowen.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      7f0f2496
    • Mel Gorman's avatar
      mm: migration: allow migration to operate asynchronously and avoid synchronous... · 77f1fe6b
      Mel Gorman authored
      mm: migration: allow migration to operate asynchronously and avoid synchronous compaction in the faster path
      
      Migration synchronously waits for writeback if the initial passes fails.
      Callers of memory compaction do not necessarily want this behaviour if the
      caller is latency sensitive or expects that synchronous migration is not
      going to have a significantly better success rate.
      
      This patch adds a sync parameter to migrate_pages() allowing the caller to
      indicate if wait_on_page_writeback() is allowed within migration or not.
      For reclaim/compaction, try_to_compact_pages() is first called
      asynchronously, direct reclaim runs and then try_to_compact_pages() is
      called synchronously as there is a greater expectation that it'll succeed.
      
      [akpm@linux-foundation.org: build/merge fix]
      Signed-off-by: default avatarMel Gorman <mel@csn.ul.ie>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Rik van Riel <riel@redhat.com>
      Acked-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Cc: Andy Whitcroft <apw@shadowen.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      77f1fe6b
    • Mel Gorman's avatar
      mm: vmscan: reclaim order-0 and use compaction instead of lumpy reclaim · 3e7d3449
      Mel Gorman authored
      Lumpy reclaim is disruptive.  It reclaims a large number of pages and
      ignores the age of the pages it reclaims.  This can incur significant
      stalls and potentially increase the number of major faults.
      
      Compaction has reached the point where it is considered reasonably stable
      (meaning it has passed a lot of testing) and is a potential candidate for
      displacing lumpy reclaim.  This patch introduces an alternative to lumpy
      reclaim whe compaction is available called reclaim/compaction.  The basic
      operation is very simple - instead of selecting a contiguous range of
      pages to reclaim, a number of order-0 pages are reclaimed and then
      compaction is later by either kswapd (compact_zone_order()) or direct
      compaction (__alloc_pages_direct_compact()).
      
      [akpm@linux-foundation.org: fix build]
      [akpm@linux-foundation.org: use conventional task_struct naming]
      Signed-off-by: default avatarMel Gorman <mel@csn.ul.ie>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Rik van Riel <riel@redhat.com>
      Acked-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Cc: Andy Whitcroft <apw@shadowen.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      3e7d3449
    • Mel Gorman's avatar
      mm: vmscan: convert lumpy_mode into a bitmask · ee64fc93
      Mel Gorman authored
      Currently lumpy_mode is an enum and determines if lumpy reclaim is off,
      syncronous or asyncronous.  In preparation for using compaction instead of
      lumpy reclaim, this patch converts the flags into a bitmap.
      Signed-off-by: default avatarMel Gorman <mel@csn.ul.ie>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Andy Whitcroft <apw@shadowen.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      ee64fc93
    • Mel Gorman's avatar
      mm: compaction: add trace events for memory compaction activity · b7aba698
      Mel Gorman authored
      In preparation for a patches promoting the use of memory compaction over
      lumpy reclaim, this patch adds trace points for memory compaction
      activity.  Using them, we can monitor the scanning activity of the
      migration and free page scanners as well as the number and success rates
      of pages passed to page migration.
      Signed-off-by: default avatarMel Gorman <mel@csn.ul.ie>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Andy Whitcroft <apw@shadowen.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      b7aba698
    • Nikanth Karthikesan's avatar
      mm: smaps: export mlock information · 2d90508f
      Nikanth Karthikesan authored
      Currently there is no way to find whether a process has locked its pages
      in memory or not.  And which of the memory regions are locked in memory.
      
      Add a new field "Locked" to export this information via the smaps file.
      Signed-off-by: default avatarNikanth Karthikesan <knikanth@suse.de>
      Acked-by: default avatarBalbir Singh <balbir@linux.vnet.ibm.com>
      Acked-by: default avatarWu Fengguang <fengguang.wu@intel.com>
      Cc: Matt Mackall <mpm@selenic.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      2d90508f
    • Joe Perches's avatar
      mm: convert sprintf_symbol to %pS · 62c70bce
      Joe Perches authored
      Signed-off-by: default avatarJoe Perches <joe@perches.com>
      Acked-by: default avatarPekka Enberg <penberg@kernel.org>
      Cc: Jiri Kosina <trivial@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      62c70bce
    • Hai Shan's avatar
      fs/mpage.c: consolidate code · c32b0d4b
      Hai Shan authored
      Merge mpage_end_io_read() and mpage_end_io_write() into mpage_end_io() to
      eliminate code duplication.
      
      [akpm@linux-foundation.org: coding-style fixes]
      Signed-off-by: default avatarHai Shan <shan.hai@windriver.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      c32b0d4b
    • Nick Piggin's avatar
      mm: find_get_pages_contig fixlet · 9cbb4cb2
      Nick Piggin authored
      Testing ->mapping and ->index without a ref is not stable as the page
      may have been reused at this point.
      Signed-off-by: default avatarNick Piggin <npiggin@kernel.dk>
      Reviewed-by: default avatarWu Fengguang <fengguang.wu@intel.com>
      Reviewed-by: default avatarMinchan Kim <minchan.kim@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      9cbb4cb2
    • KOSAKI Motohiro's avatar
      vmscan: factor out kswapd sleeping logic from kswapd() · f0bc0a60
      KOSAKI Motohiro authored
      Currently, kswapd() has deep nesting and is slightly hard to read.  Clean
      this up.
      Signed-off-by: default avatarKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      f0bc0a60
    • Bob Liu's avatar
      mm/page-writeback.c: fix __set_page_dirty_no_writeback() return value · c3f0da63
      Bob Liu authored
      __set_page_dirty_no_writeback() should return true if it actually
      transitioned the page from a clean to dirty state although it seems nobody
      uses its return value at present.
      Signed-off-by: default avatarBob Liu <lliubbo@gmail.com>
      Acked-by: default avatarWu Fengguang <fengguang.wu@intel.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      c3f0da63
    • Andrew Morton's avatar
      sync_inode_metadata: fix comment · c691b9d9
      Andrew Morton authored
      Use correct function name, remove incorrect apostrophe
      
      Cc: Christoph Hellwig <hch@lst.de>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      c691b9d9
    • Jan Kara's avatar
      writeback: avoid livelocking WB_SYNC_ALL writeback · b9543dac
      Jan Kara authored
      When wb_writeback() is called in WB_SYNC_ALL mode, work->nr_to_write is
      usually set to LONG_MAX.  The logic in wb_writeback() then calls
      __writeback_inodes_sb() with nr_to_write == MAX_WRITEBACK_PAGES and we
      easily end up with non-positive nr_to_write after the function returns, if
      the inode has more than MAX_WRITEBACK_PAGES dirty pages at the moment.
      
      When nr_to_write is <= 0 wb_writeback() decides we need another round of
      writeback but this is wrong in some cases!  For example when a single
      large file is continuously dirtied, we would never finish syncing it
      because each pass would be able to write MAX_WRITEBACK_PAGES and inode
      dirty timestamp never gets updated (as inode is never completely clean).
      Thus __writeback_inodes_sb() would write the redirtied inode again and
      again.
      
      Fix the issue by setting nr_to_write to LONG_MAX in WB_SYNC_ALL mode.  We
      do not need nr_to_write in WB_SYNC_ALL mode anyway since
      write_cache_pages() does livelock avoidance using page tagging in
      WB_SYNC_ALL mode.
      
      This makes wb_writeback() call __writeback_inodes_sb() only once on
      WB_SYNC_ALL.  The latter function won't livelock because it works on
      
      - a finite set of files by doing queue_io() once at the beginning
      - a finite set of pages by PAGECACHE_TAG_TOWRITE page tagging
      
      After this patch, program from http://lkml.org/lkml/2010/10/24/154 is no
      longer able to stall sync forever.
      
      [fengguang.wu@intel.com: fix locking comment]
      Signed-off-by: default avatarJan Kara <jack@suse.cz>
      Signed-off-by: default avatarWu Fengguang <fengguang.wu@intel.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Jan Engelhardt <jengelh@medozas.de>
      Cc: Jens Axboe <axboe@kernel.dk>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      b9543dac
    • Jan Kara's avatar
      writeback: stop background/kupdate works from livelocking other works · aa373cf5
      Jan Kara authored
      Background writeback is easily livelockable in a loop in wb_writeback() by
      a process continuously re-dirtying pages (or continuously appending to a
      file).  This is in fact intended as the target of background writeback is
      to write dirty pages it can find as long as we are over
      dirty_background_threshold.
      
      But the above behavior gets inconvenient at times because no other work
      queued in the flusher thread's queue gets processed.  In particular, since
      e.g.  sync(1) relies on flusher thread to do all the IO for it, sync(1)
      can hang forever waiting for flusher thread to do the work.
      
      Generally, when a flusher thread has some work queued, someone submitted
      the work to achieve a goal more specific than what background writeback
      does.  Moreover by working on the specific work, we also reduce amount of
      dirty pages which is exactly the target of background writeout.  So it
      makes sense to give specific work a priority over a generic page cleaning.
      
      Thus we interrupt background writeback if there is some other work to do.
      We return to the background writeback after completing all the queued
      work.
      
      This may delay the writeback of expired inodes for a while, however the
      expired inodes will eventually be flushed to disk as long as the other
      works won't livelock.
      
      [fengguang.wu@intel.com: update comment]
      Signed-off-by: default avatarJan Kara <jack@suse.cz>
      Signed-off-by: default avatarWu Fengguang <fengguang.wu@intel.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Jan Engelhardt <jengelh@medozas.de>
      Cc: Jens Axboe <axboe@kernel.dk>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      aa373cf5
    • Wu Fengguang's avatar
      writeback: trace wakeup event for background writeback · 71927e84
      Wu Fengguang authored
      This tracks when balance_dirty_pages() tries to wakeup the flusher thread
      for background writeback (if it was not started already).
      Suggested-by: default avatarChristoph Hellwig <hch@infradead.org>
      Signed-off-by: default avatarWu Fengguang <fengguang.wu@intel.com>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Jan Engelhardt <jengelh@medozas.de>
      Cc: Jens Axboe <axboe@kernel.dk>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      71927e84
    • Jan Kara's avatar
      writeback: integrated background writeback work · 6585027a
      Jan Kara authored
      Check whether background writeback is needed after finishing each work.
      
      When bdi flusher thread finishes doing some work check whether any kind of
      background writeback needs to be done (either because
      dirty_background_ratio is exceeded or because we need to start flushing
      old inodes).  If so, just do background write back.
      
      This way, bdi_start_background_writeback() just needs to wake up the
      flusher thread.  It will do background writeback as soon as there is no
      other work.
      
      This is a preparatory patch for the next patch which stops background
      writeback as soon as there is other work to do.
      Signed-off-by: default avatarJan Kara <jack@suse.cz>
      Signed-off-by: default avatarWu Fengguang <fengguang.wu@intel.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Jan Engelhardt <jengelh@medozas.de>
      Cc: Jens Axboe <axboe@kernel.dk>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      6585027a