1. 25 May, 2010 5 commits
    • Corrado Zoccolo's avatar
      page allocator: reduce fragmentation in buddy allocator by adding buddies that... · 6dda9d55
      Corrado Zoccolo authored
      page allocator: reduce fragmentation in buddy allocator by adding buddies that are merging to the tail of the free lists
      
      In order to reduce fragmentation, this patch classifies freed pages in two
      groups according to their probability of being part of a high order merge.
       Pages belonging to a compound whose next-highest buddy is free are more
      likely to be part of a high order merge in the near future, so they will
      be added at the tail of the freelist.  The remaining pages are put at the
      front of the freelist.
      
      In this way, the pages that are more likely to cause a big merge are kept
      free longer.  Consequently there is a tendency to aggregate the
      long-living allocations on a subset of the compounds, reducing the
      fragmentation.
      
      This heuristic was tested on three machines, x86, x86-64 and ppc64 with
      3GB of RAM in each machine.  The tests were kernbench, netperf, sysbench
      and STREAM for performance and a high-order stress test for huge page
      allocations.
      
      KernBench X86
      Elapsed mean     374.77 ( 0.00%)   375.10 (-0.09%)
      User    mean     649.53 ( 0.00%)   650.44 (-0.14%)
      System  mean      54.75 ( 0.00%)    54.18 ( 1.05%)
      CPU     mean     187.75 ( 0.00%)   187.25 ( 0.27%)
      
      KernBench X86-64
      Elapsed mean      94.45 ( 0.00%)    94.01 ( 0.47%)
      User    mean     323.27 ( 0.00%)   322.66 ( 0.19%)
      System  mean      36.71 ( 0.00%)    36.50 ( 0.57%)
      CPU     mean     380.75 ( 0.00%)   381.75 (-0.26%)
      
      KernBench PPC64
      Elapsed mean     173.45 ( 0.00%)   173.74 (-0.17%)
      User    mean     587.99 ( 0.00%)   587.95 ( 0.01%)
      System  mean      60.60 ( 0.00%)    60.57 ( 0.05%)
      CPU     mean     373.50 ( 0.00%)   372.75 ( 0.20%)
      
      Nothing notable for kernbench.
      
      NetPerf UDP X86
            64    42.68 ( 0.00%)     42.77 ( 0.21%)
           128    85.62 ( 0.00%)     85.32 (-0.35%)
           256   170.01 ( 0.00%)    168.76 (-0.74%)
          1024   655.68 ( 0.00%)    652.33 (-0.51%)
          2048  1262.39 ( 0.00%)   1248.61 (-1.10%)
          3312  1958.41 ( 0.00%)   1944.61 (-0.71%)
          4096  2345.63 ( 0.00%)   2318.83 (-1.16%)
          8192  4132.90 ( 0.00%)   4089.50 (-1.06%)
         16384  6770.88 ( 0.00%)   6642.05 (-1.94%)*
      
      NetPerf UDP X86-64
            64   148.82 ( 0.00%)    154.92 ( 3.94%)
           128   298.96 ( 0.00%)    312.95 ( 4.47%)
           256   583.67 ( 0.00%)    626.39 ( 6.82%)
          1024  2293.18 ( 0.00%)   2371.10 ( 3.29%)
          2048  4274.16 ( 0.00%)   4396.83 ( 2.79%)
          3312  6356.94 ( 0.00%)   6571.35 ( 3.26%)
          4096  7422.68 ( 0.00%)   7635.42 ( 2.79%)*
          8192 12114.81 ( 0.00%)* 12346.88 ( 1.88%)
         16384 17022.28 ( 0.00%)* 17033.19 ( 0.06%)*
                   1.64%             2.73%
      
      NetPerf UDP PPC64
            64    49.98 ( 0.00%)     50.25 ( 0.54%)
           128    98.66 ( 0.00%)    100.95 ( 2.27%)
           256   197.33 ( 0.00%)    191.03 (-3.30%)
          1024   761.98 ( 0.00%)    785.07 ( 2.94%)
          2048  1493.50 ( 0.00%)   1510.85 ( 1.15%)
          3312  2303.95 ( 0.00%)   2271.72 (-1.42%)
          4096  2774.56 ( 0.00%)   2773.06 (-0.05%)
          8192  4918.31 ( 0.00%)   4793.59 (-2.60%)
         16384  7497.98 ( 0.00%)   7749.52 ( 3.25%)
      
      The tests are run to have confidence limits within 1%.  Results marked
      with a * were not confident although in this case, it's only outside by
      small amounts.  Even with some results that were not confident, the
      netperf UDP results were generally positive.
      
      NetPerf TCP X86
            64   652.25 ( 0.00%)*   648.12 (-0.64%)*
                  23.80%            22.82%
           128  1229.98 ( 0.00%)*  1220.56 (-0.77%)*
                  21.03%            18.90%
           256  2105.88 ( 0.00%)   1872.03 (-12.49%)*
                   1.00%            16.46%
          1024  3476.46 ( 0.00%)*  3548.28 ( 2.02%)*
                  13.37%            11.39%
          2048  4023.44 ( 0.00%)*  4231.45 ( 4.92%)*
                   9.76%            12.48%
          3312  4348.88 ( 0.00%)*  4396.96 ( 1.09%)*
                   6.49%             8.75%
          4096  4726.56 ( 0.00%)*  4877.71 ( 3.10%)*
                   9.85%             8.50%
          8192  4732.28 ( 0.00%)*  5777.77 (18.10%)*
                   9.13%            13.04%
         16384  5543.05 ( 0.00%)*  5906.24 ( 6.15%)*
                   7.73%             8.68%
      
      NETPERF TCP X86-64
                  netperf-tcp-vanilla-netperf       netperf-tcp
                         tcp-vanilla     pgalloc-delay
            64  1895.87 ( 0.00%)*  1775.07 (-6.81%)*
                   5.79%             4.78%
           128  3571.03 ( 0.00%)*  3342.20 (-6.85%)*
                   3.68%             6.06%
           256  5097.21 ( 0.00%)*  4859.43 (-4.89%)*
                   3.02%             2.10%
          1024  8919.10 ( 0.00%)*  8892.49 (-0.30%)*
                   5.89%             6.55%
          2048 10255.46 ( 0.00%)* 10449.39 ( 1.86%)*
                   7.08%             7.44%
          3312 10839.90 ( 0.00%)* 10740.15 (-0.93%)*
                   6.87%             7.33%
          4096 10814.84 ( 0.00%)* 10766.97 (-0.44%)*
                   6.86%             8.18%
          8192 11606.89 ( 0.00%)* 11189.28 (-3.73%)*
                   7.49%             5.55%
         16384 12554.88 ( 0.00%)* 12361.22 (-1.57%)*
                   7.36%             6.49%
      
      NETPERF TCP PPC64
                  netperf-tcp-vanilla-netperf       netperf-tcp
                         tcp-vanilla     pgalloc-delay
            64   594.17 ( 0.00%)    596.04 ( 0.31%)*
                   1.00%             2.29%
           128  1064.87 ( 0.00%)*  1074.77 ( 0.92%)*
                   1.30%             1.40%
           256  1852.46 ( 0.00%)*  1856.95 ( 0.24%)
                   1.25%             1.00%
          1024  3839.46 ( 0.00%)*  3813.05 (-0.69%)
                   1.02%             1.00%
          2048  4885.04 ( 0.00%)*  4881.97 (-0.06%)*
                   1.15%             1.04%
          3312  5506.90 ( 0.00%)   5459.72 (-0.86%)
          4096  6449.19 ( 0.00%)   6345.46 (-1.63%)
          8192  7501.17 ( 0.00%)   7508.79 ( 0.10%)
         16384  9618.65 ( 0.00%)   9490.10 (-1.35%)
      
      There was a distinct lack of confidence in the X86* figures so I included
      what the devation was where the results were not confident.  Many of the
      results, whether gains or losses were within the standard deviation so no
      solid conclusion can be reached on performance impact.  Looking at the
      figures, only the X86-64 ones look suspicious with a few losses that were
      outside the noise.  However, the results were so unstable that without
      knowing why they vary so much, a solid conclusion cannot be reached.
      
      SYSBENCH X86
                    sysbench-vanilla     pgalloc-delay
                 1  7722.85 ( 0.00%)  7756.79 ( 0.44%)
                 2 14901.11 ( 0.00%) 13683.44 (-8.90%)
                 3 15171.71 ( 0.00%) 14888.25 (-1.90%)
                 4 14966.98 ( 0.00%) 15029.67 ( 0.42%)
                 5 14370.47 ( 0.00%) 14865.00 ( 3.33%)
                 6 14870.33 ( 0.00%) 14845.57 (-0.17%)
                 7 14429.45 ( 0.00%) 14520.85 ( 0.63%)
                 8 14354.35 ( 0.00%) 14362.31 ( 0.06%)
      
      SYSBENCH X86-64
                 1 17448.70 ( 0.00%) 17484.41 ( 0.20%)
                 2 34276.39 ( 0.00%) 34251.00 (-0.07%)
                 3 50805.25 ( 0.00%) 50854.80 ( 0.10%)
                 4 66667.10 ( 0.00%) 66174.69 (-0.74%)
                 5 66003.91 ( 0.00%) 65685.25 (-0.49%)
                 6 64981.90 ( 0.00%) 65125.60 ( 0.22%)
                 7 64933.16 ( 0.00%) 64379.23 (-0.86%)
                 8 63353.30 ( 0.00%) 63281.22 (-0.11%)
                 9 63511.84 ( 0.00%) 63570.37 ( 0.09%)
                10 62708.27 ( 0.00%) 63166.25 ( 0.73%)
                11 62092.81 ( 0.00%) 61787.75 (-0.49%)
                12 61330.11 ( 0.00%) 61036.34 (-0.48%)
                13 61438.37 ( 0.00%) 61994.47 ( 0.90%)
                14 62304.48 ( 0.00%) 62064.90 (-0.39%)
                15 63296.48 ( 0.00%) 62875.16 (-0.67%)
                16 63951.76 ( 0.00%) 63769.09 (-0.29%)
      
      SYSBENCH PPC64
                                   -sysbench-pgalloc-delay-sysbench
                    sysbench-vanilla     pgalloc-delay
                 1  7645.08 ( 0.00%)  7467.43 (-2.38%)
                 2 14856.67 ( 0.00%) 14558.73 (-2.05%)
                 3 21952.31 ( 0.00%) 21683.64 (-1.24%)
                 4 27946.09 ( 0.00%) 28623.29 ( 2.37%)
                 5 28045.11 ( 0.00%) 28143.69 ( 0.35%)
                 6 27477.10 ( 0.00%) 27337.45 (-0.51%)
                 7 26489.17 ( 0.00%) 26590.06 ( 0.38%)
                 8 26642.91 ( 0.00%) 25274.33 (-5.41%)
                 9 25137.27 ( 0.00%) 24810.06 (-1.32%)
                10 24451.99 ( 0.00%) 24275.85 (-0.73%)
                11 23262.20 ( 0.00%) 23674.88 ( 1.74%)
                12 24234.81 ( 0.00%) 23640.89 (-2.51%)
                13 24577.75 ( 0.00%) 24433.50 (-0.59%)
                14 25640.19 ( 0.00%) 25116.52 (-2.08%)
                15 26188.84 ( 0.00%) 26181.36 (-0.03%)
                16 26782.37 ( 0.00%) 26255.99 (-2.00%)
      
      Again, there is little to conclude here.  While there are a few losses,
      the results vary by +/- 8% in some cases.  They are the results of most
      concern as there are some large losses but it's also within the variance
      typically seen between kernel releases.
      
      The STREAM results varied so little and are so verbose that I didn't
      include them here.
      
      The final test stressed how many huge pages can be allocated.  The
      absolute number of huge pages allocated are the same with or without the
      page.  However, the "unusability free space index" which is a measure of
      external fragmentation was slightly lower (lower is better) throughout the
      lifetime of the system.  I also measured the latency of how long it took
      to successfully allocate a huge page.  The latency was slightly lower and
      on X86 and PPC64, more huge pages were allocated almost immediately from
      the free lists.  The improvement is slight but there.
      
      [mel@csn.ul.ie: Tested, reworked for less branches]
      [czoccolo@gmail.com: fix oops by checking pfn_valid_within()]
      Signed-off-by: default avatarMel Gorman <mel@csn.ul.ie>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Christoph Lameter <cl@linux-foundation.org>
      Acked-by: default avatarRik van Riel <riel@redhat.com>
      Reviewed-by: default avatarPekka Enberg <penberg@cs.helsinki.fi>
      Cc: Corrado Zoccolo <czoccolo@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      6dda9d55
    • KOSAKI Motohiro's avatar
      tmpfs: insert tmpfs cache pages to inactive list at first · e9d6c157
      KOSAKI Motohiro authored
      Shaohua Li reported parallel file copy on tmpfs can lead to OOM killer.
      This is regression of caused by commit 9ff473b9 ("vmscan: evict
      streaming IO first").  Wow, It is 2 years old patch!
      
      Currently, tmpfs file cache is inserted active list at first.  This means
      that the insertion doesn't only increase numbers of pages in anon LRU, but
      it also reduces anon scanning ratio.  Therefore, vmscan will get totally
      confused.  It scans almost only file LRU even though the system has plenty
      unused tmpfs pages.
      
      Historically, lru_cache_add_active_anon() was used for two reasons.
      1) Intend to priotize shmem page rather than regular file cache.
      2) Intend to avoid reclaim priority inversion of used once pages.
      
      But we've lost both motivation because (1) Now we have separate anon and
      file LRU list.  then, to insert active list doesn't help such priotize.
      (2) In past, one pte access bit will cause page activation.  then to
      insert inactive list with pte access bit mean higher priority than to
      insert active list.  Its priority inversion may lead to uninteded lru
      chun.  but it was already solved by commit 64574746 (vmscan: detect
      mapped file pages used only once).  (Thanks Hannes, you are great!)
      
      Thus, now we can use lru_cache_add_anon() instead.
      Signed-off-by: default avatarKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Reported-by: default avatarShaohua Li <shaohua.li@intel.com>
      Reviewed-by: default avatarWu Fengguang <fengguang.wu@intel.com>
      Reviewed-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Reviewed-by: default avatarRik van Riel <riel@redhat.com>
      Reviewed-by: default avatarMinchan Kim <minchan.kim@gmail.com>
      Acked-by: default avatarHugh Dickins <hughd@google.com>
      Cc: Henrique de Moraes Holschuh <hmh@hmh.eng.br>
      Cc: <stable@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      e9d6c157
    • Jaswinder Singh Rajput's avatar
      xtensa: includecheck fix: vectors.S · 1f0a7388
      Jaswinder Singh Rajput authored
      fix the following 'make includecheck' warnings:
      
        arch/xtensa/kernel/vectors.S: asm/processor.h is included more than once.
        arch/xtensa/kernel/vectors.S: asm/ptrace.h is included more than once.
      Signed-off-by: default avatarJaswinder Singh Rajput <jaswinderrajput@gmail.com>
      Cc: Chris Zankel <chris@zankel.net>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      1f0a7388
    • Christoph Hellwig's avatar
      xtensa: convert to asm-generic/hardirq.h · e520c410
      Christoph Hellwig authored
      Also remove lots of unused irq_cpustat fields.
      Signed-off-by: default avatarChristoph Hellwig <hch@lst.de>
      Cc: Chris Zankel <chris@zankel.net>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      e520c410
    • FUJITA Tomonori's avatar
      xtensa: set ARCH_KMALLOC_MINALIGN · 498900fc
      FUJITA Tomonori authored
      Architectures that handle DMA-non-coherent memory need to set
      ARCH_KMALLOC_MINALIGN to make sure that kmalloc'ed buffer is DMA-safe: the
      buffer doesn't share a cache with the others.
      Signed-off-by: default avatarFUJITA Tomonori <fujita.tomonori@lab.ntt.co.jp>
      Cc: Chris Zankel <chris@zankel.net>
      Acked-by: default avatarPekka Enberg <penberg@cs.helsinki.fi>
      Cc: <stable@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      498900fc
  2. 24 May, 2010 23 commits
  3. 23 May, 2010 2 commits
  4. 22 May, 2010 10 commits