1. 26 Oct, 2010 40 commits
    • KOSAKI Motohiro's avatar
      vmscan: synchronous lumpy reclaim should not call congestion_wait() · bc57e00f
      KOSAKI Motohiro authored
      congestion_wait() means "wait until queue congestion is cleared".
      However, synchronous lumpy reclaim does not need this congestion_wait() as
      shrink_page_list(PAGEOUT_IO_SYNC) uses wait_on_page_writeback() and it
      provides the necessary waiting.
      Signed-off-by: default avatarKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Signed-off-by: default avatarMel Gorman <mel@csn.ul.ie>
      Reviewed-by: default avatarMinchan Kim <minchan.kim@gmail.com>
      Reviewed-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Reviewed-by: default avatarWu Fengguang <fengguang.wu@intel.com>
      Reviewed-by: default avatarKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Rik van Riel <riel@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      bc57e00f
    • Mel Gorman's avatar
      writeback: account for time spent congestion_waited · 52bb9198
      Mel Gorman authored
      There is strong evidence to indicate a lot of time is being spent in
      congestion_wait(), some of it unnecessarily.  This patch adds a tracepoint
      for congestion_wait to record when congestion_wait() was called, how long
      the timeout was for and how long it actually slept.
      Signed-off-by: default avatarMel Gorman <mel@csn.ul.ie>
      Reviewed-by: default avatarMinchan Kim <minchan.kim@gmail.com>
      Reviewed-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Cc: Wu Fengguang <fengguang.wu@intel.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Rik van Riel <riel@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      52bb9198
    • Mel Gorman's avatar
      tracing, vmscan: add trace events for LRU list shrinking · e11da5b4
      Mel Gorman authored
      There have been numerous reports of stalls that pointed at the problem
      being somewhere in the VM.  There are multiple roots to the problems which
      means dealing with any of the root problems in isolation is tricky to
      justify on their own and they would still need integration testing.  This
      patch series puts together two different patch sets which in combination
      should tackle some of the root causes of latency problems being reported.
      
      Patch 1 adds a tracepoint for shrink_inactive_list.  For this series, the
      most important results is being able to calculate the scanning/reclaim
      ratio as a measure of the amount of work being done by page reclaim.
      
      Patch 2 accounts for time spent in congestion_wait.
      
      Patches 3-6 were originally developed by Kosaki Motohiro but reworked for
      this series.  It has been noted that lumpy reclaim is far too aggressive
      and trashes the system somewhat.  As SLUB uses high-order allocations, a
      large cost incurred by lumpy reclaim will be noticeable.  It was also
      reported during transparent hugepage support testing that lumpy reclaim
      was trashing the system and these patches should mitigate that problem
      without disabling lumpy reclaim.
      
      Patch 7 adds wait_iff_congested() and replaces some callers of
      congestion_wait().  wait_iff_congested() only sleeps if there is a BDI
      that is currently congested.  Patch 8 notes that any BDI being congested
      is not necessarily a problem because there could be multiple BDIs of
      varying speeds and numberous zones.  It attempts to track when a zone
      being reclaimed contains many pages backed by a congested BDI and if so,
      reclaimers wait on the congestion queue.
      
      I ran a number of tests with monitoring on X86, X86-64 and PPC64. Each
      machine had 3G of RAM and the CPUs were
      
      X86:    Intel P4 2-core
      X86-64: AMD Phenom 4-core
      PPC64:  PPC970MP
      
      Each used a single disk and the onboard IO controller.  Dirty ratio was
      left at 20.  I'm just going to report for X86-64 and PPC64 in a vague
      attempt to keep this report short.  Four kernels were tested each based on
      v2.6.36-rc4
      
      traceonly-v2r2:     Patches 1 and 2 to instrument vmscan reclaims and congestion_wait
      lowlumpy-v2r3:      Patches 1-6 to test if lumpy reclaim is better
      waitcongest-v2r3:   Patches 1-7 to only wait on congestion
      waitwriteback-v2r4: Patches 1-8 to detect when a zone is congested
      
      nocongest-v1r5: Patches 1-3 for testing wait_iff_congestion
      nodirect-v1r5:  Patches 1-10 to disable filesystem writeback for better IO
      
      The tests run were as follows
      
      kernbench
      	compile-based benchmark. Smoke test performance
      
      sysbench
      	OLTP read-only benchmark. Will be re-run in the future as read-write
      
      micro-mapped-file-stream
      	This is a micro-benchmark from Johannes Weiner that accesses a
      	large sparse-file through mmap(). It was configured to run in only
      	single-CPU mode but can be indicative of how well page reclaim
      	identifies suitable pages.
      
      stress-highalloc
      	Tries to allocate huge pages under heavy load.
      
      kernbench, iozone and sysbench did not report any performance regression
      on any machine.  sysbench did pressure the system lightly and there was
      reclaim activity but there were no difference of major interest between
      the kernels.
      
      X86-64 micro-mapped-file-stream
      
                                            traceonly-v2r2           lowlumpy-v2r3        waitcongest-v2r3     waitwriteback-v2r4
      pgalloc_dma                       1639.00 (   0.00%)       667.00 (-145.73%)      1167.00 ( -40.45%)       578.00 (-183.56%)
      pgalloc_dma32                  2842410.00 (   0.00%)   2842626.00 (   0.01%)   2843043.00 (   0.02%)   2843014.00 (   0.02%)
      pgalloc_normal                       0.00 (   0.00%)         0.00 (   0.00%)         0.00 (   0.00%)         0.00 (   0.00%)
      pgsteal_dma                        729.00 (   0.00%)        85.00 (-757.65%)       609.00 ( -19.70%)       125.00 (-483.20%)
      pgsteal_dma32                  2338721.00 (   0.00%)   2447354.00 (   4.44%)   2429536.00 (   3.74%)   2436772.00 (   4.02%)
      pgsteal_normal                       0.00 (   0.00%)         0.00 (   0.00%)         0.00 (   0.00%)         0.00 (   0.00%)
      pgscan_kswapd_dma                 1469.00 (   0.00%)       532.00 (-176.13%)      1078.00 ( -36.27%)       220.00 (-567.73%)
      pgscan_kswapd_dma32            4597713.00 (   0.00%)   4503597.00 (  -2.09%)   4295673.00 (  -7.03%)   3891686.00 ( -18.14%)
      pgscan_kswapd_normal                 0.00 (   0.00%)         0.00 (   0.00%)         0.00 (   0.00%)         0.00 (   0.00%)
      pgscan_direct_dma                   71.00 (   0.00%)       134.00 (  47.01%)       243.00 (  70.78%)       352.00 (  79.83%)
      pgscan_direct_dma32             305820.00 (   0.00%)    280204.00 (  -9.14%)    600518.00 (  49.07%)    957485.00 (  68.06%)
      pgscan_direct_normal                 0.00 (   0.00%)         0.00 (   0.00%)         0.00 (   0.00%)         0.00 (   0.00%)
      pageoutrun                       16296.00 (   0.00%)     21254.00 (  23.33%)     18447.00 (  11.66%)     20067.00 (  18.79%)
      allocstall                         443.00 (   0.00%)       273.00 ( -62.27%)       513.00 (  13.65%)      1568.00 (  71.75%)
      
      These are based on the raw figures taken from /proc/vmstat.  It's a rough
      measure of reclaim activity.  Note that allocstall counts are higher
      because we are entering direct reclaim more often as a result of not
      sleeping in congestion.  In itself, it's not necessarily a bad thing.
      It's easier to get a view of what happened from the vmscan tracepoint
      report.
      
      FTrace Reclaim Statistics: vmscan
      
                                      traceonly-v2r2   lowlumpy-v2r3 waitcongest-v2r3 waitwriteback-v2r4
      Direct reclaims                                443        273        513       1568
      Direct reclaim pages scanned                305968     280402     600825     957933
      Direct reclaim pages reclaimed               43503      19005      30327     117191
      Direct reclaim write file async I/O              0          0          0          0
      Direct reclaim write anon async I/O              0          3          4         12
      Direct reclaim write file sync I/O               0          0          0          0
      Direct reclaim write anon sync I/O               0          0          0          0
      Wake kswapd requests                        187649     132338     191695     267701
      Kswapd wakeups                                   3          1          4          1
      Kswapd pages scanned                       4599269    4454162    4296815    3891906
      Kswapd pages reclaimed                     2295947b    2428434    2399818    2319706
      Kswapd reclaim write file async I/O              1          0          1          1
      Kswapd reclaim write anon async I/O             59        187         41        222
      Kswapd reclaim write file sync I/O               0          0          0          0
      Kswapd reclaim write anon sync I/O               0          0          0          0
      Time stalled direct reclaim (seconds)         4.34       2.52       6.63       2.96
      Time kswapd awake (seconds)                  11.15      10.25      11.01      10.19
      
      Total pages scanned                        4905237   4734564   4897640   4849839
      Total pages reclaimed                      2339450   2447439   2430145   2436897
      %age total pages scanned/reclaimed          47.69%    51.69%    49.62%    50.25%
      %age total pages scanned/written             0.00%     0.00%     0.00%     0.00%
      %age  file pages scanned/written             0.00%     0.00%     0.00%     0.00%
      Percentage Time Spent Direct Reclaim        29.23%    19.02%    38.48%    20.25%
      Percentage Time kswapd Awake                78.58%    78.85%    76.83%    79.86%
      
      What is interesting here for nocongest in particular is that while direct
      reclaim scans more pages, the overall number of pages scanned remains the
      same and the ratio of pages scanned to pages reclaimed is more or less the
      same.  In other words, while we are sleeping less, reclaim is not doing
      more work and as direct reclaim and kswapd is awake for less time, it
      would appear to be doing less work.
      
      FTrace Reclaim Statistics: congestion_wait
      Direct number congest     waited                87        196         64          0
      Direct time   congest     waited            4604ms     4732ms     5420ms        0ms
      Direct full   congest     waited                72        145         53          0
      Direct number conditional waited                 0          0        324       1315
      Direct time   conditional waited               0ms        0ms        0ms        0ms
      Direct full   conditional waited                 0          0          0          0
      KSwapd number congest     waited                20         10         15          7
      KSwapd time   congest     waited            1264ms      536ms      884ms      284ms
      KSwapd full   congest     waited                10          4          6          2
      KSwapd number conditional waited                 0          0          0          0
      KSwapd time   conditional waited               0ms        0ms        0ms        0ms
      KSwapd full   conditional waited                 0          0          0          0
      
      The vanilla kernel spent 8 seconds asleep in direct reclaim and no time at
      all asleep with the patches.
      
      MMTests Statistics: duration
      User/Sys Time Running Test (seconds)         10.51     10.73      10.6     11.66
      Total Elapsed Time (seconds)                 14.19     13.00     14.33     12.76
      
      Overall, the tests completed faster. It is interesting to note that backing off further
      when a zone is congested and not just a BDI was more efficient overall.
      
      PPC64 micro-mapped-file-stream
      pgalloc_dma                    3024660.00 (   0.00%)   3027185.00 (   0.08%)   3025845.00 (   0.04%)   3026281.00 (   0.05%)
      pgalloc_normal                       0.00 (   0.00%)         0.00 (   0.00%)         0.00 (   0.00%)         0.00 (   0.00%)
      pgsteal_dma                    2508073.00 (   0.00%)   2565351.00 (   2.23%)   2463577.00 (  -1.81%)   2532263.00 (   0.96%)
      pgsteal_normal                       0.00 (   0.00%)         0.00 (   0.00%)         0.00 (   0.00%)         0.00 (   0.00%)
      pgscan_kswapd_dma              4601307.00 (   0.00%)   4128076.00 ( -11.46%)   3912317.00 ( -17.61%)   3377165.00 ( -36.25%)
      pgscan_kswapd_normal                 0.00 (   0.00%)         0.00 (   0.00%)         0.00 (   0.00%)         0.00 (   0.00%)
      pgscan_direct_dma               629825.00 (   0.00%)    971622.00 (  35.18%)   1063938.00 (  40.80%)   1711935.00 (  63.21%)
      pgscan_direct_normal                 0.00 (   0.00%)         0.00 (   0.00%)         0.00 (   0.00%)         0.00 (   0.00%)
      pageoutrun                       27776.00 (   0.00%)     20458.00 ( -35.77%)     18763.00 ( -48.04%)     18157.00 ( -52.98%)
      allocstall                         977.00 (   0.00%)      2751.00 (  64.49%)      2098.00 (  53.43%)      5136.00 (  80.98%)
      
      Similar trends to x86-64. allocstalls are up but it's not necessarily bad.
      
      FTrace Reclaim Statistics: vmscan
      Direct reclaims                                977       2709       2098       5136
      Direct reclaim pages scanned                629825     963814    1063938    1711935
      Direct reclaim pages reclaimed               75550     242538     150904     387647
      Direct reclaim write file async I/O              0          0          0          2
      Direct reclaim write anon async I/O              0         10          0          4
      Direct reclaim write file sync I/O               0          0          0          0
      Direct reclaim write anon sync I/O               0          0          0          0
      Wake kswapd requests                        392119    1201712     571935     571921
      Kswapd wakeups                                   3          2          3          3
      Kswapd pages scanned                       4601307    4128076    3912317    3377165
      Kswapd pages reclaimed                     2432523    2318797    2312673    2144616
      Kswapd reclaim write file async I/O             20          1          1          1
      Kswapd reclaim write anon async I/O             57        132         11        121
      Kswapd reclaim write file sync I/O               0          0          0          0
      Kswapd reclaim write anon sync I/O               0          0          0          0
      Time stalled direct reclaim (seconds)         6.19       7.30      13.04      10.88
      Time kswapd awake (seconds)                  21.73      26.51      25.55      23.90
      
      Total pages scanned                        5231132   5091890   4976255   5089100
      Total pages reclaimed                      2508073   2561335   2463577   2532263
      %age total pages scanned/reclaimed          47.95%    50.30%    49.51%    49.76%
      %age total pages scanned/written             0.00%     0.00%     0.00%     0.00%
      %age  file pages scanned/written             0.00%     0.00%     0.00%     0.00%
      Percentage Time Spent Direct Reclaim        18.89%    20.65%    32.65%    27.65%
      Percentage Time kswapd Awake                72.39%    80.68%    78.21%    77.40%
      
      Again, a similar trend that the congestion_wait changes mean that direct
      reclaim scans more pages but the overall number of pages scanned while
      slightly reduced, are very similar.  The ratio of scanning/reclaimed
      remains roughly similar.  The downside is that kswapd and direct reclaim
      was awake longer and for a larger percentage of the overall workload.
      It's possible there were big differences in the amount of time spent
      reclaiming slab pages between the different kernels which is plausible
      considering that the micro tests runs after fsmark and sysbench.
      
      Trace Reclaim Statistics: congestion_wait
      Direct number congest     waited               845       1312        104          0
      Direct time   congest     waited           19416ms    26560ms     7544ms        0ms
      Direct full   congest     waited               745       1105         72          0
      Direct number conditional waited                 0          0       1322       2935
      Direct time   conditional waited               0ms        0ms       12ms      312ms
      Direct full   conditional waited                 0          0          0          3
      KSwapd number congest     waited                39        102         75         63
      KSwapd time   congest     waited            2484ms     6760ms     5756ms     3716ms
      KSwapd full   congest     waited                20         48         46         25
      KSwapd number conditional waited                 0          0          0          0
      KSwapd time   conditional waited               0ms        0ms        0ms        0ms
      KSwapd full   conditional waited                 0          0          0          0
      
      The vanilla kernel spent 20 seconds asleep in direct reclaim and only
      312ms asleep with the patches.  The time kswapd spent congest waited was
      also reduced by a large factor.
      
      MMTests Statistics: duration
      ser/Sys Time Running Test (seconds)         26.58     28.05      26.9     28.47
      Total Elapsed Time (seconds)                 30.02     32.86     32.67     30.88
      
      With all patches applies, the completion times are very similar.
      
      X86-64 STRESS-HIGHALLOC
                      traceonly-v2r2     lowlumpy-v2r3  waitcongest-v2r3waitwriteback-v2r4
      Pass 1          82.00 ( 0.00%)    84.00 ( 2.00%)    85.00 ( 3.00%)    85.00 ( 3.00%)
      Pass 2          90.00 ( 0.00%)    87.00 (-3.00%)    88.00 (-2.00%)    89.00 (-1.00%)
      At Rest         92.00 ( 0.00%)    90.00 (-2.00%)    90.00 (-2.00%)    91.00 (-1.00%)
      
      Success figures across the board are broadly similar.
      
                      traceonly-v2r2     lowlumpy-v2r3  waitcongest-v2r3waitwriteback-v2r4
      Direct reclaims                               1045        944        886        887
      Direct reclaim pages scanned                135091     119604     109382     101019
      Direct reclaim pages reclaimed               88599      47535      47863      46671
      Direct reclaim write file async I/O            494        283        465        280
      Direct reclaim write anon async I/O          29357      13710      16656      13462
      Direct reclaim write file sync I/O             154          2          2          3
      Direct reclaim write anon sync I/O           14594        571        509        561
      Wake kswapd requests                          7491        933        872        892
      Kswapd wakeups                                 814        778        731        780
      Kswapd pages scanned                       7290822   15341158   11916436   13703442
      Kswapd pages reclaimed                     3587336    3142496    3094392    3187151
      Kswapd reclaim write file async I/O          91975      32317      28022      29628
      Kswapd reclaim write anon async I/O        1992022     789307     829745     849769
      Kswapd reclaim write file sync I/O               0          0          0          0
      Kswapd reclaim write anon sync I/O               0          0          0          0
      Time stalled direct reclaim (seconds)      4588.93    2467.16    2495.41    2547.07
      Time kswapd awake (seconds)                2497.66    1020.16    1098.06    1176.82
      
      Total pages scanned                        7425913  15460762  12025818  13804461
      Total pages reclaimed                      3675935   3190031   3142255   3233822
      %age total pages scanned/reclaimed          49.50%    20.63%    26.13%    23.43%
      %age total pages scanned/written            28.66%     5.41%     7.28%     6.47%
      %age  file pages scanned/written             1.25%     0.21%     0.24%     0.22%
      Percentage Time Spent Direct Reclaim        57.33%    42.15%    42.41%    42.99%
      Percentage Time kswapd Awake                43.56%    27.87%    29.76%    31.25%
      
      Scanned/reclaimed ratios again look good with big improvements in
      efficiency.  The Scanned/written ratios also look much improved.  With a
      better scanned/written ration, there is an expectation that IO would be
      more efficient and indeed, the time spent in direct reclaim is much
      reduced by the full series and kswapd spends a little less time awake.
      
      Overall, indications here are that allocations were happening much faster
      and this can be seen with a graph of the latency figures as the
      allocations were taking place
      http://www.csn.ul.ie/~mel/postings/vmscanreduce-20101509/highalloc-interlatency-hydra-mean.ps
      
      FTrace Reclaim Statistics: congestion_wait
      Direct number congest     waited              1333        204        169          4
      Direct time   congest     waited           78896ms     8288ms     7260ms      200ms
      Direct full   congest     waited               756         92         69          2
      Direct number conditional waited                 0          0         26        186
      Direct time   conditional waited               0ms        0ms        0ms     2504ms
      Direct full   conditional waited                 0          0          0         25
      KSwapd number congest     waited                 4        395        227        282
      KSwapd time   congest     waited             384ms    25136ms    10508ms    18380ms
      KSwapd full   congest     waited                 3        232         98        176
      KSwapd number conditional waited                 0          0          0          0
      KSwapd time   conditional waited               0ms        0ms        0ms        0ms
      KSwapd full   conditional waited                 0          0          0          0
      KSwapd full   conditional waited               318          0        312          9
      
      Overall, the time spent speeping is reduced.  kswapd is still hitting
      congestion_wait() but that is because there are callers remaining where it
      wasn't clear in advance if they should be changed to wait_iff_congested()
      or not.  Overall the sleep imes are reduced though - from 79ish seconds to
      about 19.
      
      MMTests Statistics: duration
      User/Sys Time Running Test (seconds)       3415.43   3386.65   3388.39    3377.5
      Total Elapsed Time (seconds)               5733.48   3660.33   3689.41   3765.39
      
      With the full series, the time to complete the tests are reduced by 30%
      
      PPC64 STRESS-HIGHALLOC
                      traceonly-v2r2     lowlumpy-v2r3  waitcongest-v2r3waitwriteback-v2r4
      Pass 1          17.00 ( 0.00%)    34.00 (17.00%)    38.00 (21.00%)    43.00 (26.00%)
      Pass 2          25.00 ( 0.00%)    37.00 (12.00%)    42.00 (17.00%)    46.00 (21.00%)
      At Rest         49.00 ( 0.00%)    43.00 (-6.00%)    45.00 (-4.00%)    51.00 ( 2.00%)
      
      Success rates there are *way* up particularly considering that the 16MB
      huge pages on PPC64 mean that it's always much harder to allocate them.
      
      FTrace Reclaim Statistics: vmscan
                    stress-highalloc  stress-highalloc  stress-highalloc  stress-highalloc
                      traceonly-v2r2     lowlumpy-v2r3  waitcongest-v2r3waitwriteback-v2r4
      Direct reclaims                                499        505        564        509
      Direct reclaim pages scanned                223478      41898      51818      45605
      Direct reclaim pages reclaimed              137730      21148      27161      23455
      Direct reclaim write file async I/O            399        136        162        136
      Direct reclaim write anon async I/O          46977       2865       4686       3998
      Direct reclaim write file sync I/O              29          0          1          3
      Direct reclaim write anon sync I/O           31023        159        237        239
      Wake kswapd requests                           420        351        360        326
      Kswapd wakeups                                 185        294        249        277
      Kswapd pages scanned                      15703488   16392500   17821724   17598737
      Kswapd pages reclaimed                     5808466    2908858    3139386    3145435
      Kswapd reclaim write file async I/O         159938      18400      18717      13473
      Kswapd reclaim write anon async I/O        3467554     228957     322799     234278
      Kswapd reclaim write file sync I/O               0          0          0          0
      Kswapd reclaim write anon sync I/O               0          0          0          0
      Time stalled direct reclaim (seconds)      9665.35    1707.81    2374.32    1871.23
      Time kswapd awake (seconds)                9401.21    1367.86    1951.75    1328.88
      
      Total pages scanned                       15926966  16434398  17873542  17644342
      Total pages reclaimed                      5946196   2930006   3166547   3168890
      %age total pages scanned/reclaimed          37.33%    17.83%    17.72%    17.96%
      %age total pages scanned/written            23.27%     1.52%     1.94%     1.43%
      %age  file pages scanned/written             1.01%     0.11%     0.11%     0.08%
      Percentage Time Spent Direct Reclaim        44.55%    35.10%    41.42%    36.91%
      Percentage Time kswapd Awake                86.71%    43.58%    52.67%    41.14%
      
      While the scanning rates are slightly up, the scanned/reclaimed and
      scanned/written figures are much improved.  The time spent in direct
      reclaim and with kswapd are massively reduced, mostly by the lowlumpy
      patches.
      
      FTrace Reclaim Statistics: congestion_wait
      Direct number congest     waited               725        303        126          3
      Direct time   congest     waited           45524ms     9180ms     5936ms      300ms
      Direct full   congest     waited               487        190         52          3
      Direct number conditional waited                 0          0        200        301
      Direct time   conditional waited               0ms        0ms        0ms     1904ms
      Direct full   conditional waited                 0          0          0         19
      KSwapd number congest     waited                 0          2         23          4
      KSwapd time   congest     waited               0ms      200ms      420ms      404ms
      KSwapd full   congest     waited                 0          2          2          4
      KSwapd number conditional waited                 0          0          0          0
      KSwapd time   conditional waited               0ms        0ms        0ms        0ms
      KSwapd full   conditional waited                 0          0          0          0
      
      Not as dramatic a story here but the time spent asleep is reduced and we
      can still see what wait_iff_congested is going to sleep when necessary.
      
      MMTests Statistics: duration
      User/Sys Time Running Test (seconds)      12028.09   3157.17   3357.79   3199.16
      Total Elapsed Time (seconds)              10842.07   3138.72   3705.54   3229.85
      
      The time to complete this test goes way down.  With the full series, we
      are allocating over twice the number of huge pages in 30% of the time and
      there is a corresponding impact on the allocation latency graph available
      at.
      
      http://www.csn.ul.ie/~mel/postings/vmscanreduce-20101509/highalloc-interlatency-powyah-mean.ps
      
      This patch:
      
      Add a trace event for shrink_inactive_list() and updates the sample
      postprocessing script appropriately.  It can be used to determine how many
      pages were reclaimed and for non-lumpy reclaim where exactly the pages
      were reclaimed from.
      Signed-off-by: default avatarMel Gorman <mel@csn.ul.ie>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Minchan Kim <minchan.kim@gmail.com>
      Cc: Wu Fengguang <fengguang.wu@intel.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Rik van Riel <riel@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      e11da5b4
    • Shaohua Li's avatar
      vmscan: delete dead code · 66d9a986
      Shaohua Li authored
      `priority' cannot be negative here.  And the comment is obsolete.
      Signed-off-by: default avatarShaohua Li <shaohua.li@intel.com>
      Reviewed-by: default avatarKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      66d9a986
    • Will Deacon's avatar
      mm: fix typo in mm.h when NODE_NOT_IN_PAGE_FLAGS · bce54bbf
      Will Deacon authored
      NODE_NOT_IN_PAGE_FLAGS is defined in mm.h when the node information is not
      stored in the page flags bitmap.
      
      Unfortunately, there's a typo in one of the checks for it.  This patch
      fixes it (s/NODE_NOT_IN_PAGEFLAGS/NODE_NOT_IN_PAGE_FLAGS/).  Since this
      has been around for ages, I doubt it's been causing any serious problems.
      Signed-off-by: default avatarWill Deacon <will.deacon@arm.com>
      Cc: Christoph Lameter <clameter@sgi.com>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      bce54bbf
    • Michael Rubin's avatar
      writeback: report dirty thresholds in /proc/vmstat · 79da826a
      Michael Rubin authored
      The kernel already exposes the user desired thresholds in /proc/sys/vm
      with dirty_background_ratio and background_ratio.  But the kernel may
      alter the number requested without giving the user any indication that is
      the case.
      
      Knowing the actual ratios the kernel is honoring can help app developers
      understand how their buffered IO will be sent to the disk.
      
              $ grep threshold /proc/vmstat
              nr_dirty_threshold 409111
              nr_dirty_background_threshold 818223
      Signed-off-by: default avatarMichael Rubin <mrubin@google.com>
      Cc: Wu Fengguang <fengguang.wu@intel.com>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Nick Piggin <nickpiggin@yahoo.com.au>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      79da826a
    • Michael Rubin's avatar
      writeback: add /sys/devices/system/node/<node>/vmstat · 2ac39037
      Michael Rubin authored
      For NUMA node systems it is important to have visibility in memory
      characteristics.  Two of the /proc/vmstat values "nr_written" and
      "nr_dirtied" are added here.
      
      	# cat /sys/devices/system/node/node20/vmstat
      	nr_written 0
      	nr_dirtied 0
      Signed-off-by: default avatarMichael Rubin <mrubin@google.com>
      Reviewed-by: default avatarWu Fengguang <fengguang.wu@intel.com>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Nick Piggin <nickpiggin@yahoo.com.au>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      2ac39037
    • Michael Rubin's avatar
      writeback: add nr_dirtied and nr_written to /proc/vmstat · ea941f0e
      Michael Rubin authored
      To help developers and applications gain visibility into writeback
      behaviour adding two entries to vm_stat_items and /proc/vmstat.  This will
      allow us to track the "written" and "dirtied" counts.
      
         # grep nr_dirtied /proc/vmstat
         nr_dirtied 3747
         # grep nr_written /proc/vmstat
         nr_written 3618
      Signed-off-by: default avatarMichael Rubin <mrubin@google.com>
      Reviewed-by: default avatarWu Fengguang <fengguang.wu@intel.com>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Nick Piggin <nickpiggin@yahoo.com.au>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      ea941f0e
    • Michael Rubin's avatar
      mm: add account_page_writeback() · f629d1c9
      Michael Rubin authored
      To help developers and applications gain visibility into writeback
      behaviour this patch adds two counters to /proc/vmstat.
      
        # grep nr_dirtied /proc/vmstat
        nr_dirtied 3747
        # grep nr_written /proc/vmstat
        nr_written 3618
      
      These entries allow user apps to understand writeback behaviour over time
      and learn how it is impacting their performance.  Currently there is no
      way to inspect dirty and writeback speed over time.  It's not possible for
      nr_dirty/nr_writeback.
      
      These entries are necessary to give visibility into writeback behaviour.
      We have /proc/diskstats which lets us understand the io in the block
      layer.  We have blktrace for more in depth understanding.  We have
      e2fsprogs and debugsfs to give insight into the file systems behaviour,
      but we don't offer our users the ability understand what writeback is
      doing.  There is no way to know how active it is over the whole system, if
      it's falling behind or to quantify it's efforts.  With these values
      exported users can easily see how much data applications are sending
      through writeback and also at what rates writeback is processing this
      data.  Comparing the rates of change between the two allow developers to
      see when writeback is not able to keep up with incoming traffic and the
      rate of dirty memory being sent to the IO back end.  This allows folks to
      understand their io workloads and track kernel issues.  Non kernel
      engineers at Google often use these counters to solve puzzling performance
      problems.
      
      Patch #4 adds a pernode vmstat file with nr_dirtied and nr_written
      
      Patch #5 add writeback thresholds to /proc/vmstat
      
      Currently these values are in debugfs. But they should be promoted to
      /proc since they are useful for developers who are writing databases
      and file servers and are not debugging the kernel.
      
      The output is as below:
      
       # grep threshold /proc/vmstat
       nr_pages_dirty_threshold 409111
       nr_pages_dirty_background_threshold 818223
      
      This patch:
      
      This allows code outside of the mm core to safely manipulate page
      writeback state and not worry about the other accounting.  Not using these
      routines means that some code will lose track of the accounting and we get
      bugs.
      
      Modify nilfs2 to use interface.
      Signed-off-by: default avatarMichael Rubin <mrubin@google.com>
      Reviewed-by: default avatarKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Reviewed-by: default avatarWu Fengguang <fengguang.wu@intel.com>
      Cc: KONISHI Ryusuke <konishi.ryusuke@lab.ntt.co.jp>
      Cc: Jiro SEKIBA <jir@unicus.jp>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Nick Piggin <nickpiggin@yahoo.com.au>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      f629d1c9
    • Vasiliy Kulikov's avatar
      mm/mempolicy.c: check return code of check_range · 0def08e3
      Vasiliy Kulikov authored
      Function check_range may return ERR_PTR(...). Check for it.
      Signed-off-by: default avatarVasiliy Kulikov <segooon@gmail.com>
      Acked-by: default avatarDavid Rientjes <rientjes@google.com>
      Reviewed-by: default avatarChristoph Lameter <cl@linux.com>
      Reviewed-by: default avatarKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      0def08e3
    • Minchan Kim's avatar
      vmscan: prevent background aging of anon page in no swap system · 74e3f3c3
      Minchan Kim authored
      Ying Han reported that backing aging of anon pages in no swap system
      causes unnecessary TLB flush.
      
      When I sent a patch(69c85481), I wanted this patch but Rik pointed out
      and allowed aging of anon pages to give a chance to promote from inactive
      to active LRU.
      
      It has a two problem.
      
      1) non-swap system
      
      Never make sense to age anon pages.
      
      2) swap configured but still doesn't swapon
      
      It doesn't make sense to age anon pages until swap-on time.  But it's
      arguable.  If we have aged anon pages by swapon, VM have moved anon pages
      from active to inactive.  And in the time swapon by admin, the VM can't
      reclaim hot pages so we can protect hot pages swapout.
      
      But let's think about it.  When does swap-on happen?  It depends on admin.
       we can't expect it.  Nonetheless, we have done aging of anon pages to
      protect hot pages swapout.  It means we lost run time overhead when below
      high watermark but gain hot page swap-[in/out] overhead when VM decide
      swapout.  Is it true?  Let's think more detail.  We don't promote anon
      pages in case of non-swap system.  So even though VM does aging of anon
      pages, the pages would be in inactive LRU for a long time.  It means many
      of pages in there would mark access bit again.  So access bit hot/code
      separation would be pointless.
      
      This patch prevents unnecessary anon pages demotion in not-yet-swapon and
      non-configured swap system.  Even, in non-configuared swap system
      inactive_anon_is_low can be compiled out.
      
      It could make side effect that hot anon pages could swap out when admin
      does swap on.  But I think sooner or later it would be steady state.  So
      it's not a big problem.
      
      We could lose someting but gain more thing(TLB flush and unnecessary
      function call to demote anon pages).
      Signed-off-by: default avatarYing Han <yinghan@google.com>
      Signed-off-by: default avatarMinchan Kim <minchan.kim@gmail.com>
      Reviewed-by: default avatarRik van Riel <riel@redhat.com>
      Reviewed-by: default avatarKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      74e3f3c3
    • KAMEZAWA Hiroyuki's avatar
      memory hotplug: unify is_removable and offline detection code · 49ac8255
      KAMEZAWA Hiroyuki authored
      Now, sysfs interface of memory hotplug shows whether the section is
      removable or not.  But it checks only migrateype of pages and doesn't
      check details of cluster of pages.
      
      Next, memory hotplug's set_migratetype_isolate() has the same kind of
      check, too.
      
      This patch adds the function __count_unmovable_pages() and makes above 2
      checks to use the same logic.  Then, is_removable and hotremove code uses
      the same logic.  No changes in the hotremove logic itself.
      
      TODO: need to find a way to check RECLAMABLE. But, considering bit,
            calling shrink_slab() against a range before starting memory hotremove
            sounds better. If so, this patch's logic doesn't need to be changed.
      
      [akpm@linux-foundation.org: coding-style fixes]
      Signed-off-by: default avatarKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Reported-by: default avatarMichal Hocko <mhocko@suse.cz>
      Cc: Wu Fengguang <fengguang.wu@intel.com>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      49ac8255
    • KAMEZAWA Hiroyuki's avatar
      memory hotplug: fix notifier's return value check · 4b20477f
      KAMEZAWA Hiroyuki authored
      Even if notifier cannot find any pages, it doesn't mean no pages are
      available...And, if there are no notifiers registered, this condition will
      be always true and memory hotplug will show -EBUSY.
      
      This is a bug but not critical.
      
      In most case, a pageblock which will be offlined is MIGRATE_MOVABLE This
      "notifier" is called only when the pageblock is _not_ MIGRATE_MOVABLE.
      But if not MIGRATE_MOVABLE, it's common case that memory hotplug will
      fail.  So, no one notice this bug.
      Signed-off-by: default avatarKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: Wu Fengguang <fengguang.wu@intel.com>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      4b20477f
    • Minchan Kim's avatar
      mm: compaction: fix COMPACTPAGEFAILED counting · cf608ac1
      Minchan Kim authored
      Presently update_nr_listpages() doesn't have a role.  That's because lists
      passed is always empty just after calling migrate_pages.  The
      migrate_pages cleans up page list which have failed to migrate before
      returning by aaa994b3.
      
       [PATCH] page migration: handle freeing of pages in migrate_pages()
      
       Do not leave pages on the lists passed to migrate_pages().  Seems that we will
       not need any postprocessing of pages.  This will simplify the handling of
       pages by the callers of migrate_pages().
      
      At that time, we thought we don't need any postprocessing of pages.  But
      the situation is changed.  The compaction need to know the number of
      failed to migrate for COMPACTPAGEFAILED stat
      
      This patch makes new rule for caller of migrate_pages to call
      putback_lru_pages.  So caller need to clean up the lists so it has a
      chance to postprocess the pages.  [suggested by Christoph Lameter]
      Signed-off-by: default avatarMinchan Kim <minchan.kim@gmail.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Andi Kleen <andi@firstfloor.org>
      Reviewed-by: default avatarMel Gorman <mel@csn.ul.ie>
      Reviewed-by: default avatarWu Fengguang <fengguang.wu@intel.com>
      Acked-by: default avatarChristoph Lameter <cl@linux.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      cf608ac1
    • Thadeu Lima de Souza Cascardo's avatar
      mm: only build per-node scan_unevictable functions when NUMA is enabled · e4455abb
      Thadeu Lima de Souza Cascardo authored
      Non-NUMA systems do never create these files anyway, since they are only
      created by driver subsystem when NUMA is configured.
      
      [akpm@linux-foundation.org: cleanup]
      Signed-off-by: default avatarThadeu Lima de Souza Cascardo <cascardo@holoscopio.com>
      Reviewed-by: default avatarKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Lee Schermerhorn <lee.schermerhorn@hp.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      e4455abb
    • zeal's avatar
      include/linux/pageblock-flags.h: fix set_pageblock_flags() macro definiton · f19e77a3
      zeal authored
      The presently-unused macro was missing one parameter.
      Signed-off-by: default avatarzeal <zealcook@gmail.com>
      Acked-by: default avatarMel Gorman <mel@csn.ul.ie>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      f19e77a3
    • Wu Fengguang's avatar
      writeback: remove nonblocking/encountered_congestion references · 1b430bee
      Wu Fengguang authored
      This removes more dead code that was somehow missed by commit 0d99519e
      (writeback: remove unused nonblocking and congestion checks).  There are
      no behavior change except for the removal of two entries from one of the
      ext4 tracing interface.
      
      The nonblocking checks in ->writepages are no longer used because the
      flusher now prefer to block on get_request_wait() than to skip inodes on
      IO congestion.  The latter will lead to more seeky IO.
      
      The nonblocking checks in ->writepage are no longer used because it's
      redundant with the WB_SYNC_NONE check.
      
      We no long set ->nonblocking in VM page out and page migration, because
      a) it's effectively redundant with WB_SYNC_NONE in current code
      b) it's old semantic of "Don't get stuck on request queues" is mis-behavior:
         that would skip some dirty inodes on congestion and page out others, which
         is unfair in terms of LRU age.
      
      Inspired by Christoph Hellwig. Thanks!
      Signed-off-by: default avatarWu Fengguang <fengguang.wu@intel.com>
      Cc: Theodore Ts'o <tytso@mit.edu>
      Cc: David Howells <dhowells@redhat.com>
      Cc: Sage Weil <sage@newdream.net>
      Cc: Steve French <sfrench@samba.org>
      Cc: Chris Mason <chris.mason@oracle.com>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Christoph Hellwig <hch@infradead.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      1b430bee
    • David Rientjes's avatar
      oom: fix locking for oom_adj and oom_score_adj · d19d5476
      David Rientjes authored
      The locking order in oom_adjust_write() and oom_score_adj_write() for
      task->alloc_lock and task->sighand->siglock is reversed, and lockdep
      notices that irqs could encounter an ABBA scenario.
      
      This fixes the locking order so that we always take task_lock(task) prior
      to lock_task_sighand(task).
      Signed-off-by: default avatarDavid Rientjes <rientjes@google.com>
      Reported-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Ying Han <yinghan@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      d19d5476
    • David Rientjes's avatar
      oom: rewrite error handling for oom_adj and oom_score_adj tunables · 723548bf
      David Rientjes authored
      It's better to use proper error handling in oom_adjust_write() and
      oom_score_adj_write() instead of duplicating the locking order on various
      exit paths.
      Suggested-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarDavid Rientjes <rientjes@google.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Ying Han <yinghan@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      723548bf
    • David Rientjes's avatar
      oom: kill all threads sharing oom killed task's mm · 1e99bad0
      David Rientjes authored
      It's necessary to kill all threads that share an oom killed task's mm if
      the goal is to lead to future memory freeing.
      
      This patch reintroduces the code removed in 8c5cd6f3 (oom: oom_kill
      doesn't kill vfork parent (or child)) since it is obsoleted.
      
      It's now guaranteed that any task passed to oom_kill_task() does not share
      an mm with any thread that is unkillable.  Thus, we're safe to issue a
      SIGKILL to any thread sharing the same mm.
      
      This is especially necessary to solve an mm->mmap_sem livelock issue
      whereas an oom killed thread must acquire the lock in the exit path while
      another thread is holding it in the page allocator while trying to
      allocate memory itself (and will preempt the oom killer since a task was
      already killed).  Since tasks with pending fatal signals are now granted
      access to memory reserves, the thread holding the lock may quickly
      allocate and release the lock so that the oom killed task may exit.
      
      This mainly is for threads that are cloned with CLONE_VM but not
      CLONE_THREAD, so they are in a different thread group.  Non-NPTL threads
      exist in the wild and this change is necessary to prevent the livelock in
      such cases.  We care more about preventing the livelock than incurring the
      additional tasklist in the oom killer when a task has been killed.
      Systems that are sufficiently large to not want the tasklist scan in the
      oom killer in the first place already have the option of enabling
      /proc/sys/vm/oom_kill_allocating_task, which was designed specifically for
      that purpose.
      
      This code had existed in the oom killer for over eight years dating back
      to the 2.4 kernel.
      
      [akpm@linux-foundation.org: add nice comment]
      Signed-off-by: default avatarDavid Rientjes <rientjes@google.com>
      Acked-by: default avatarKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Ying Han <yinghan@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      1e99bad0
    • David Rientjes's avatar
      oom: avoid killing a task if a thread sharing its mm cannot be killed · e18641e1
      David Rientjes authored
      The oom killer's goal is to kill a memory-hogging task so that it may
      exit, free its memory, and allow the current context to allocate the
      memory that triggered it in the first place.  Thus, killing a task is
      pointless if other threads sharing its mm cannot be killed because of its
      /proc/pid/oom_adj or /proc/pid/oom_score_adj value.
      
      This patch checks whether any other thread sharing p->mm has an
      oom_score_adj of OOM_SCORE_ADJ_MIN.  If so, the thread cannot be killed
      and oom_badness(p) returns 0, meaning it's unkillable.
      Signed-off-by: default avatarDavid Rientjes <rientjes@google.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Ying Han <yinghan@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      e18641e1
    • Ying Han's avatar
      oom: add per-mm oom disable count · 3d5992d2
      Ying Han authored
      It's pointless to kill a task if another thread sharing its mm cannot be
      killed to allow future memory freeing.  A subsequent patch will prevent
      kills in such cases, but first it's necessary to have a way to flag a task
      that shares memory with an OOM_DISABLE task that doesn't incur an
      additional tasklist scan, which would make select_bad_process() an O(n^2)
      function.
      
      This patch adds an atomic counter to struct mm_struct that follows how
      many threads attached to it have an oom_score_adj of OOM_SCORE_ADJ_MIN.
      They cannot be killed by the kernel, so their memory cannot be freed in
      oom conditions.
      
      This only requires task_lock() on the task that we're operating on, it
      does not require mm->mmap_sem since task_lock() pins the mm and the
      operation is atomic.
      
      [rientjes@google.com: changelog and sys_unshare() code]
      [rientjes@google.com: protect oom_disable_count with task_lock in fork]
      [rientjes@google.com: use old_mm for oom_disable_count in exec]
      Signed-off-by: default avatarYing Han <yinghan@google.com>
      Signed-off-by: default avatarDavid Rientjes <rientjes@google.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Rik van Riel <riel@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      3d5992d2
    • Matt Mackall's avatar
      Documentation/filesystems/proc.txt: improve smaps field documentation · 0f4d208f
      Matt Mackall authored
      Signed-off-by: default avatarMatt Mackall <mpm@selenic.com>
      Cc: Nikanth Karthikesan <knikanth@suse.de>
      Cc: Balbir Singh <balbir@linux.vnet.ibm.com>
      Cc: Wu Fengguang <fengguang.wu@intel.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      0f4d208f
    • WANG Cong's avatar
      vmcore: it is not experimental any more · a4f7326d
      WANG Cong authored
      We use vmcore in our production kernel for a long time, it is pretty
      stable now.  So I don't think we need to mark it as experimental any more.
      Signed-off-by: default avatarWANG Cong <xiyou.wangcong@gmail.com>
      Acked-by: default avatarNeil Horman <nhorman@tuxdriver.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      a4f7326d
    • Richard Weinberger's avatar
      um: fix IRQ flag handling naming · dbec9213
      Richard Weinberger authored
      Commit df9ee292 ("Fix IRQ flag handling naming") changed the IRQ flag
      handling naming scheme and broke UML:
      
      In file included from arch/um/include/asm/fixmap.h:5,
                       from arch/um/include/shared/um_uaccess.h:10,
                       from arch/um/include/asm/uaccess.h:41,
                       from arch/um/include/asm/thread_info.h:13,
                       from include/linux/thread_info.h:56,
                       from include/linux/preempt.h:9,
                       from include/linux/spinlock.h:50,
                       from include/linux/seqlock.h:29,
                       from include/linux/time.h:8,
                       from include/linux/stat.h:60,
                       from include/linux/module.h:10,
                       from init/main.c:13:
      arch/um/include/asm/system.h:11:1: warning: "local_save_flags" redefined
      
      This patch brings the new scheme to UML and makes it work again.
      Signed-off-by: default avatarRichard Weinberger <richard@nod.at>
      Acked-by: default avatarDavid Howells <dhowells@redhat.com>
      Cc: Jeff Dike <jdike@addtoit.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      dbec9213
    • Masanori ITOH's avatar
      percpu: fix list_head init bug in __percpu_counter_init() · 8474b591
      Masanori ITOH authored
      WARNING: at lib/list_debug.c:26 __list_add+0x3f/0x81()
      Hardware name: Express5800/B120a [N8400-085]
      list_add corruption. next->prev should be prev (ffffffff81a7ea00), but was dead000000200200. (next=ffff88080b872d58).
      Modules linked in: aoe ipt_MASQUERADE iptable_nat nf_nat autofs4 sunrpc bridge 8021q garp stp llc ipv6 cpufreq_ondemand acpi_cpufreq freq_table dm_round_robin dm_multipath kvm_intel kvm uinput lpfc scsi_transport_fc igb ioatdma scsi_tgt i2c_i801 i2c_core dca iTCO_wdt iTCO_vendor_support pcspkr shpchp megaraid_sas [last unloaded: aoe]
      Pid: 54, comm: events/3 Tainted: G        W  2.6.34-vanilla1 #1
      Call Trace:
      [<ffffffff8104bd77>] warn_slowpath_common+0x7c/0x94
      [<ffffffff8104bde6>] warn_slowpath_fmt+0x41/0x43
      [<ffffffff8120fd2e>] __list_add+0x3f/0x81
      [<ffffffff81212a12>] __percpu_counter_init+0x59/0x6b
      [<ffffffff810d8499>] bdi_init+0x118/0x17e
      [<ffffffff811f2c50>] blk_alloc_queue_node+0x79/0x143
      [<ffffffff811f2d2b>] blk_alloc_queue+0x11/0x13
      [<ffffffffa02a931d>] aoeblk_gdalloc+0x8e/0x1c9 [aoe]
      [<ffffffffa02aa655>] aoecmd_sleepwork+0x25/0xa8 [aoe]
      [<ffffffff8106186c>] worker_thread+0x1a9/0x237
      [<ffffffffa02aa630>] ? aoecmd_sleepwork+0x0/0xa8 [aoe]
      [<ffffffff81065827>] ? autoremove_wake_function+0x0/0x39
      [<ffffffff810616c3>] ? worker_thread+0x0/0x237
      [<ffffffff810653ad>] kthread+0x7f/0x87
      [<ffffffff8100aa24>] kernel_thread_helper+0x4/0x10
      [<ffffffff8106532e>] ? kthread+0x0/0x87
      [<ffffffff8100aa20>] ? kernel_thread_helper+0x0/0x10
      
      It's because there is no initialization code for a list_head contained in
      the struct backing_dev_info under CONFIG_HOTPLUG_CPU, and the bug comes up
      when block device drivers calling blk_alloc_queue() are used.  In case of
      me, I got them by using aoe.
      Signed-off-by: default avatarMasanori Itoh <itoumsn@nttdata.co.jp>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: <stable@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      8474b591
    • Andrew Morton's avatar
      kfifo: disable __kfifo_must_check_helper() · 52c51712
      Andrew Morton authored
      This helper is wrong: it coerces signed values into unsigned ones, so code
      such as
      
      	if (kfifo_alloc(...) < 0) {
      		error
      	}
      
      will fail to detect the error.
      
      So let's disable __kfifo_must_check_helper() for 2.6.36.
      
      Cc: Randy Dunlap <randy.dunlap@oracle.com>
      Cc: Stefani Seibold <stefani@seibold.net>
      Cc: <stable@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      52c51712
    • Richard Weinberger's avatar
      hostfs: fix UML crash: remove f_spare from hostfs · 1b627d57
      Richard Weinberger authored
      365b1818 ("add f_flags to struct statfs(64)") resized f_spare within
      struct statfs which caused a UML crash.  There is no need to copy f_spare.
      Signed-off-by: default avatarRichard Weinberger <richard@nod.at>
      Reported-by: default avatarToralf Förster <toralf.foerster@gmx.de>
      Tested-by: default avatarToralf Förster <toralf.foerster@gmx.de>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Jeff Dike <jdike@addtoit.com>
      Cc: <stable@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      1b627d57
    • Eric Dumazet's avatar
      ipmi: proper spinlock initialization · de5e2ddf
      Eric Dumazet authored
      Unloading ipmi module can trigger following error.  (if
      CONFIG_DEBUG_SPINLOCK=y)
      
      [ 9633.779590] BUG: spinlock bad magic on CPU#1, rmmod/7170
      [ 9633.779606]  lock: f41f5414, .magic: 00000000, .owner:
      <none>/-1, .owner_cpu: 0
      [ 9633.779626] Pid: 7170, comm: rmmod Not tainted
      2.6.36-rc7-11474-gb71eb1e-dirty #328
      [ 9633.779644] Call Trace:
      [ 9633.779657]  [<c13921cc>] ? printk+0x18/0x1c
      [ 9633.779672]  [<c11a1f33>] spin_bug+0xa3/0xf0
      [ 9633.779685]  [<c11a1ffd>] do_raw_spin_lock+0x7d/0x160
      [ 9633.779702]  [<c1131537>] ? release_sysfs_dirent+0x47/0xb0
      [ 9633.779718]  [<c1131b78>] ? sysfs_addrm_finish+0xa8/0xd0
      [ 9633.779734]  [<c1394bac>] _raw_spin_lock_irqsave+0xc/0x20
      [ 9633.779752]  [<f99d93da>] cleanup_one_si+0x6a/0x200 [ipmi_si]
      [ 9633.779768]  [<c11305b2>] ? sysfs_hash_and_remove+0x72/0x80
      [ 9633.779786]  [<f99dcf26>] ipmi_pnp_remove+0xd/0xf [ipmi_si]
      [ 9633.779802]  [<c11f622b>] pnp_device_remove+0x1b/0x40
      
      Fix this by initializing spinlocks in a smi_info_alloc() helper function,
      right after memory allocation and clearing.
      Signed-off-by: default avatarEric Dumazet <eric.dumazet@gmail.com>
      Acked-by: default avatarDavid Miller <davem@davemloft.net>
      Cc: Yinghai Lu <yinghai@kernel.org>
      Acked-by: default avatarCorey Minyard <cminyard@mvista.com>
      Cc: <stable@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      de5e2ddf
    • Michael Hennerich's avatar
      drivers/misc/ad525x_dpot.c: fix typo in spi write16 and write24 transfer counts · 1f9fa521
      Michael Hennerich authored
      This is a bug fix.  Some SPI connected devices using 16/24 bit accesses,
      previously failed, now work.
      
      This typo slipped in after testing, during some restructuring.
      Signed-off-by: default avatarMichael Hennerich <michael.hennerich@analog.com>
      Cc: Mike Frysinger <vapier@gentoo.org>
      Cc: Chris Verges <chrisv@cyberswitching.com>
      Cc: <stable@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      1f9fa521
    • Richard Weinberger's avatar
      um: remove PAGE_SIZE alignment in linker script causing kernel segfault. · 6915e04f
      Richard Weinberger authored
      The linker script cleanup that I did in commit 5d150a97 ("um: Clean up
      linker script using standard macros.") (2.6.32) accidentally introduced an
      ALIGN(PAGE_SIZE) when converting to use INIT_TEXT_SECTION; Richard
      Weinberger reported that this causes the kernel to segfault with
      CONFIG_STATIC_LINK=y.
      
      I'm not certain why this extra alignment is a problem, but it seems likely
      it is because previously
      
      __init_begin = _stext = _text = _sinittext
      
      and with the extra ALIGN(PAGE_SIZE), _sinittext becomes different from the
      rest.  So there is likely a bug here where something is assuming that
      _sinittext is the same as one of those other symbols.  But reverting the
      accidental change fixes the regression, so it seems worth committing that
      now.
      Signed-off-by: default avatarTim Abbott <tabbott@ksplice.com>
      Reported-by: default avatarRichard Weinberger <richard@nod.at>
      Cc: Jeff Dike <jdike@addtoit.com>
      Tested by: Antoine Martin <antoine@nagafix.co.uk>
      Cc: <stable@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      6915e04f
    • Robin Holt's avatar
      sgi-xp: incoming XPC channel messages can come in after the channel's... · 09358972
      Robin Holt authored
      sgi-xp: incoming XPC channel messages can come in after the channel's partition structures have been torn down
      
      Under some workloads, some channel messages have been observed being
      delayed on the sending side past the point where the receiving side has
      been able to tear down its partition structures.
      
      This condition is already detected in xpc_handle_activate_IRQ_uv(), but
      that information is not given to xpc_handle_activate_mq_msg_uv().  As a
      result, xpc_handle_activate_mq_msg_uv() assumes the structures still exist
      and references them, causing a NULL-pointer deref.
      Signed-off-by: default avatarRobin Holt <holt@sgi.com>
      Cc: <stable@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      09358972
    • Richard Weinberger's avatar
      um: fix global timer issue when using CONFIG_NO_HZ · 482db6df
      Richard Weinberger authored
      This fixes a issue which was introduced by fe2cc53e ("uml: track and make
      up lost ticks").
      
      timeval_to_ns() returns long long and not int.  Due to that UML's timer
      did not work properlt and caused timer freezes.
      Signed-off-by: default avatarRichard Weinberger <richard@nod.at>
      Acked-by: default avatarPekka Enberg <penberg@kernel.org>
      Cc: Jeff Dike <jdike@addtoit.com>
      Cc: <stable@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      482db6df
    • Mel Gorman's avatar
      mm, page-allocator: do not check the state of a non-existant buddy during free · b7f50cfa
      Mel Gorman authored
      There is a bug in commit 6dda9d55 ("page allocator: reduce fragmentation
      in buddy allocator by adding buddies that are merging to the tail of the
      free lists") that means a buddy at order MAX_ORDER is checked for merging.
       A page of this order never exists so at times, an effectively random
      piece of memory is being checked.
      
      Alan Curry has reported that this is causing memory corruption in
      userspace data on a PPC32 platform (http://lkml.org/lkml/2010/10/9/32).
      It is not clear why this is happening.  It could be a cache coherency
      problem where pages mapped in both user and kernel space are getting
      different cache lines due to the bad read from kernel space
      (http://lkml.org/lkml/2010/10/13/179).  It could also be that there are
      some special registers being io-remapped at the end of the memmap array
      and that a read has special meaning on them.  Compiler bugs have been
      ruled out because the assembly before and after the patch looks relatively
      harmless.
      
      This patch fixes the problem by ensuring we are not reading a possibly
      invalid location of memory.  It's not clear why the read causes corruption
      but one way or the other it is a buggy read.
      Signed-off-by: default avatarMel Gorman <mel@csn.ul.ie>
      Cc: Corrado Zoccolo <czoccolo@gmail.com>
      Reported-by: default avatarAlan Curry <pacman@kosh.dhis.org>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Christoph Lameter <cl@linux-foundation.org>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: <stable@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      b7f50cfa
    • Andrew Morton's avatar
      types.h: move misplaced comment · a75d3776
      Andrew Morton authored
      This comment landed in the wrong place.
      
      Cc: Andi Kleen <andi@firstfloor.org>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Cc: David Miller <davem@davemloft.net>
      Cc: Eric Paris <eparis@redhat.com>
      Cc: Jan Engelhardt <jengelh@medozas.de>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      a75d3776
    • KAMEZAWA Hiroyuki's avatar
      mm: fix return value of scan_lru_pages in memory unplug · f8f72ad5
      KAMEZAWA Hiroyuki authored
      scan_lru_pages returns pfn. So, it's type should be "unsigned long"
      not "int".
      
      Note: I guess this has been work until now because memory hotplug tester's
            machine has not very big memory....
            physical address < 32bit << PAGE_SHIFT.
      Reported-by: default avatarKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Signed-off-by: default avatarKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Reviewed-by: default avatarKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: <stable@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      f8f72ad5
    • Linus Torvalds's avatar
      Merge git://git.infradead.org/battery-2.6 · 45352bbf
      Linus Torvalds authored
      * git://git.infradead.org/battery-2.6:
        power_supply: Makefile cleanup
        bq27x00_battery: Add missing kfree(di->bus) in bq27x00_battery_remove()
        power_supply: Introduce maximum current property
        power_supply: Add types for USB chargers
        ds2782_battery: Fix units
        power_supply: Add driver for TWL4030/TPS65950 BCI charger
        bq20z75: Add support for more power supply properties
        wm831x_power: Add missing kfree(wm831x_power) in wm831x_power_remove()
        jz4740-battery: Add missing kfree(jz_battery) in jz_battery_remove()
        ds2760_battery: Add missing kfree(di) in ds2760_battery_remove()
        olpc_battery: Fix endian neutral breakage for s16 values
        ds2760_battery: Fix W1 and W1_SLAVE_DS2760 dependency
        pcf50633-charger: Add missing sysfs_remove_group()
        power_supply: Add driver for TI BQ20Z75 gas gauge IC
        wm831x_power: Remove duplicate chg mask
        omap: rx51: Add support for USB chargers
        power_supply: Add isp1704 charger detection driver
      45352bbf
    • Linus Torvalds's avatar
      Merge branch 'linux_next' of git://git.kernel.org/pub/scm/linux/kernel/git/mchehab/i7core · da62aa69
      Linus Torvalds authored
      * 'linux_next' of git://git.kernel.org/pub/scm/linux/kernel/git/mchehab/i7core: (34 commits)
        i7core_edac: return -ENODEV when devices were already probed
        i7core_edac: properly terminate pci_dev_table
        i7core_edac: Avoid PCI refcount to reach zero on successive load/reload
        i7core_edac: Fix refcount error at PCI devices
        i7core_edac: it is safe to i7core_unregister_mci() when mci=NULL
        i7core_edac: Fix an oops at i7core probe
        i7core_edac: Remove unused member channels in i7core_pvt
        i7core_edac: Remove unused arg csrow from get_dimm_config
        i7core_edac: Reduce args of i7core_register_mci
        i7core_edac: Introduce i7core_unregister_mci
        i7core_edac: Use saved pointers
        i7core_edac: Check probe counter in i7core_remove
        i7core_edac: Call pci_dev_put() when alloc_i7core_dev()  failed
        i7core_edac: Fix error path of i7core_register_mci
        i7core_edac: Fix order of lines in i7core_register_mci
        i7core_edac: Always do get/put for all devices
        i7core_edac: Introduce i7core_pci_ctl_create/release
        i7core_edac: Introduce free_i7core_dev
        i7core_edac: Introduce alloc_i7core_dev
        i7core_edac: Reduce args of i7core_get_onedevice
        ...
      da62aa69
    • Linus Torvalds's avatar
      Merge branch 'hwpoison' of git://git.kernel.org/pub/scm/linux/kernel/git/ak/linux-mce-2.6 · f1ebdd60
      Linus Torvalds authored
      * 'hwpoison' of git://git.kernel.org/pub/scm/linux/kernel/git/ak/linux-mce-2.6: (22 commits)
        Add _addr_lsb field to ia64 siginfo
        Fix migration.c compilation on s390
        HWPOISON: Remove retry loop for try_to_unmap
        HWPOISON: Turn addr_valid from bitfield into char
        HWPOISON: Disable DEBUG by default
        HWPOISON: Convert pr_debugs to pr_info
        HWPOISON: Improve comments in memory-failure.c
        x86: HWPOISON: Report correct address granuality for huge hwpoison faults
        Encode huge page size for VM_FAULT_HWPOISON errors
        Fix build error with !CONFIG_MIGRATION
        hugepage: move is_hugepage_on_freelist inside ifdef to avoid warning
        Clean up __page_set_anon_rmap
        HWPOISON, hugetlb: fix unpoison for hugepage
        HWPOISON, hugetlb: soft offlining for hugepage
        HWPOSION, hugetlb: recover from free hugepage error when !MF_COUNT_INCREASED
        hugetlb: move refcounting in hugepage allocation inside hugetlb_lock
        HWPOISON, hugetlb: add free check to dequeue_hwpoison_huge_page()
        hugetlb: hugepage migration core
        hugetlb: redefine hugepage copy functions
        hugetlb: add allocate function for hugepage migration
        ...
      f1ebdd60
    • Linus Torvalds's avatar
      Merge branch 'for_linus' of git://github.com/at91linux/linux-2.6-at91 · f99d0553
      Linus Torvalds authored
      * 'for_linus' of git://github.com/at91linux/linux-2.6-at91:
        AT91: rtc: enable built-in RTC in Kconfig for at91sam9g45 family
        at91/atmel-mci: inclusion of sd/mmc driver in at91sam9g45 chip and board
        AT91: pm: make sure that r0 is 0 when dealing with cache operations
        AT91: pm: use plain cpu_do_idle() for "wait for interrupt"
        AT91: reset: extend alternate reset procedure to several chips
        AT91: reset routine cleanup, remove not needed icache flush
        AT91: trivial: align comment of at91sam9g20_reset with one more tab
        AT91: Fix AT91SAM9G20 reset as per the errata in the data sheet
        AT91: add board support for Pcontrol_G20
      f99d0553