1. 15 Mar, 2011 1 commit
  2. 14 Mar, 2011 2 commits
  3. 11 Mar, 2011 3 commits
    • Dave Chinner's avatar
      GFS2: introduce AIL lock · d6a079e8
      Dave Chinner authored
      The log lock is currently used to protect the AIL lists and
      the movements of buffers into and out of them. The lists
      are self contained and no log specific items outside the
      lists are accessed when starting or emptying the AIL lists.
      
      Hence the operation of the AIL does not require the protection
      of the log lock so split them out into a new AIL specific lock
      to reduce the amount of traffic on the log lock. This will
      also reduce the amount of serialisation that occurs when
      the gfs2_logd pushes on the AIL to move it forward.
      
      This reduces the impact of log pushing on sequential write
      throughput.
      Signed-off-by: default avatarDave Chinner <dchinner@redhat.com>
      Signed-off-by: default avatarSteven Whitehouse <swhiteho@redhat.com>
      d6a079e8
    • Benjamin Marzinski's avatar
      GFS2: fix block allocation check for fallocate · e4a7b7b0
      Benjamin Marzinski authored
      GFS2 fallocate wasn't properly checking if a blocks were already allocated.
      In write_empty_blocks(), if a page didn't have buffer_heads attached, GFS2
      was always treating it as if there were no blocks allocated for that page.
      GFS2 now calls gfs2_block_map() to check if the blocks are allocated before
      writing them out.
      Signed-off-by: default avatarBenjamin Marzinski <bmarzins@redhat.com>
      Signed-off-by: default avatarSteven Whitehouse <swhiteho@redhat.com>
      e4a7b7b0
    • Bob Peterson's avatar
      GFS2: Optimize glock multiple-dequeue code · fa1bbdea
      Bob Peterson authored
      This is a small patch that optimizes multiple glock dequeue
      operations.  It changes the unlock order to be more efficient
      and makes it easier for lock debugging tools to unravel.  It
      also eliminates the need for the temp variable x, although
      that would likely be optimized out.
      Signed-off-by: default avatarBob Peterson <rpeterso@redhat.com>
      Signed-off-by: default avatarSteven Whitehouse <swhiteho@redhat.com>
      fa1bbdea
  4. 09 Mar, 2011 3 commits
    • Steven Whitehouse's avatar
      GFS2: Remove potential race in flock code · 0a33443b
      Steven Whitehouse authored
      This patch ensures that we always wait for glock demotion when
      dropping flocks on a file in order to prevent any race
      conditions associated with further flock calls or closing
      the file.
      Signed-off-by: default avatarSteven Whitehouse <swhiteho@redhat.com>
      0a33443b
    • Steven Whitehouse's avatar
      GFS2: Fix glock deallocation race · fc0e38da
      Steven Whitehouse authored
      This patch fixes a race in deallocating glocks which was introduced
      in the RCU glock patch. We need to ensure that the glock count is
      kept correct even in the case that there is a race to add a new
      glock into the hash table. Also, to avoid having to wait for an
      RCU grace period, the glock counter can be decremented before
      call_rcu() is called.
      Signed-off-by: default avatarSteven Whitehouse <swhiteho@redhat.com>
      fc0e38da
    • Abhijith Das's avatar
      GFS2: quota allows exceeding hard limit · 662e3a55
      Abhijith Das authored
      Immediately after being synced to disk, cached quotas are zeroed out and a
      subsequent access of the cached quotas results in incorrect zero values. This
      meant that gfs2 assumed the actual usage to be the zero (or near-zero) usage
      values it found in the cached quotas and comparison against warn/limits never
      triggered a quota violation.
      
      This patch adds a new flag QDF_REFRESH that is set after a sync so that the
      cached quotas are forcefully refreshed from disk on a subsequent access on
      seeing this flag set.
      
      Resolves: rhbz#675944
      Signed-off-by: default avatarAbhi Das <adas@redhat.com>
      Signed-off-by: default avatarSteven Whitehouse <swhiteho@redhat.com>
      662e3a55
  5. 24 Feb, 2011 1 commit
    • Bob Peterson's avatar
      GFS2: deallocation performance patch · 4c16c36a
      Bob Peterson authored
      This patch is a performance improvement to GFS2's dealloc code.
      Rather than update the quota file and statfs file for every
      single block that's stripped off in unlink function do_strip,
      this patch keeps track and updates them once for every layer
      that's stripped.  This is done entirely inside the existing
      transaction, so there should be no risk of corruption.
      The other functions that deallocate blocks will be unaffected
      because they are using wrapper functions that do the same
      thing that they do today.
      
      I tested this code on my roth cluster by creating 200
      files in a directory, each of which is 100MB, then on
      four nodes, I simultaneously deleted the files, thus competing
      for GFS2 resources (but different files).  The commands
      I used were:
      
      [root@roth-01]# time for i in `seq 1 4 200` ; do rm /mnt/gfs2/bigdir/gfs2.$i; done
      [root@roth-02]# time for i in `seq 2 4 200` ; do rm /mnt/gfs2/bigdir/gfs2.$i; done
      [root@roth-03]# time for i in `seq 3 4 200` ; do rm /mnt/gfs2/bigdir/gfs2.$i; done
      [root@roth-05]# time for i in `seq 4 4 200` ; do rm /mnt/gfs2/bigdir/gfs2.$i; done
      
      The performance increase was significant:
      
                   roth-01     roth-02     roth-03     roth-05
                   ---------   ---------   ---------   ---------
      old: real    0m34.027    0m25.021s   0m23.906s   0m35.646s
      new: real    0m22.379s   0m24.362s   0m24.133s   0m18.562s
      
      Total time spent deleting:
      old: 118.6s
      new:  89.4
      
      For this particular case, this showed a 25% performance increase for
      GFS2 unlinks.
      Signed-off-by: default avatarBob Peterson <rpeterso@redhat.com>
      Signed-off-by: default avatarSteven Whitehouse <swhiteho@redhat.com>
      4c16c36a
  6. 07 Feb, 2011 1 commit
  7. 02 Feb, 2011 1 commit
    • Steven Whitehouse's avatar
      GFS2: Improve cluster mmap scalability · b9c93bb7
      Steven Whitehouse authored
      The mmap system call grabs a glock when an update to atime maybe
      required. It does this in order to ensure that the flags on the
      inode are uptodate, but since it will only mark atime for a future
      update, an exclusive lock is not required here (one will be taken
      later when the actual update is performed).
      
      Also, the lock can be skipped when the mount is marked noatime in
      addition to the original check which only looked at the noatime
      flag for the inode itself.
      
      This should increase the scalability of the mmap call when multiple
      nodes are all mmaping the same file.
      Reported-by: default avatarScooter Morris <scooter@cgl.ucsf.edu>
      Signed-off-by: default avatarSteven Whitehouse <swhiteho@redhat.com>
      b9c93bb7
  8. 31 Jan, 2011 1 commit
  9. 21 Jan, 2011 27 commits
    • Steven Whitehouse's avatar
      GFS2: Post-VFS scale update for RCU path walk · 75d5cfbe
      Steven Whitehouse authored
      We can allow a few more cases to use RCU path walking than
      originally allowed. It should be possible to also enable
      RCU path walking when the glock is already cached. Thats
      a bit more complicated though, so left for a future patch.
      Signed-off-by: default avatarSteven Whitehouse <swhiteho@redhat.com>
      Cc: Nick Piggin <npiggin@gmail.com>
      75d5cfbe
    • Steven Whitehouse's avatar
      GFS2: Use RCU for glock hash table · bc015cb8
      Steven Whitehouse authored
      This has a number of advantages:
      
       - Reduces contention on the hash table lock
       - Makes the code smaller and simpler
       - Should speed up glock dumps when under load
       - Removes ref count changing in examine_bucket
       - No longer need hash chain lock in glock_put() in common case
      
      There are some further changes which this enables and which
      we may do in the future. One is to look at using SLAB_RCU,
      and another is to look at using a per-cpu counter for the
      per-sb glock counter, since that is touched twice in the
      lifetime of each glock (but only used at umount time).
      Signed-off-by: default avatarSteven Whitehouse <swhiteho@redhat.com>
      Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
      bc015cb8
    • Linus Torvalds's avatar
      Merge branch 'core-fixes-for-linus' of... · 2b1caf6e
      Linus Torvalds authored
      Merge branch 'core-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip
      
      * 'core-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
        smp: Allow on_each_cpu() to be called while early_boot_irqs_disabled status to init/main.c
        lockdep: Move early boot local IRQ enable/disable status to init/main.c
      2b1caf6e
    • Rafael J. Wysocki's avatar
      ACPI / PM: Call suspend_nvs_free() earlier during resume · d551d81d
      Rafael J. Wysocki authored
      It turns out that some device drivers map pages from the ACPI NVS region
      during resume using ioremap(), which conflicts with ioremap_cache() used
      for mapping those pages by the NVS save/restore code in nvs.c.
      
      Make the NVS pages mapped by the code in nvs.c be unmapped before device
      drivers' resume routines run.
      Signed-off-by: default avatarRafael J. Wysocki <rjw@sisk.pl>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      d551d81d
    • Rafael J. Wysocki's avatar
      ACPI: Introduce acpi_os_ioremap() · 2d6d9fd3
      Rafael J. Wysocki authored
      Commit ca9b600b ("ACPI / PM: Make suspend_nvs_save() use
      acpi_os_map_memory()") attempted to prevent the code in osl.c and nvs.c
      from using different ioremap() variants by making the latter use
      acpi_os_map_memory() for mapping the NVS pages.  However, that also
      requires acpi_os_unmap_memory() to be used for unmapping them, which
      causes synchronize_rcu() to be executed many times in a row
      unnecessarily and introduces substantial delays during resume on some
      systems.
      
      Instead of using acpi_os_map_memory() for mapping the NVS pages in nvs.c
      introduce acpi_os_ioremap() calling ioremap_cache() and make the code in
      both osl.c and nvs.c use it.
      Reported-by: default avatarJeff Chua <jeff.chua.linux@gmail.com>
      Signed-off-by: default avatarRafael J. Wysocki <rjw@sisk.pl>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      2d6d9fd3
    • Linus Torvalds's avatar
      Merge branch 'akpm' · 8d99641f
      Linus Torvalds authored
      * akpm:
        kernel/smp.c: consolidate writes in smp_call_function_interrupt()
        kernel/smp.c: fix smp_call_function_many() SMP race
        memcg: correctly order reading PCG_USED and pc->mem_cgroup
        backlight: fix 88pm860x_bl macro collision
        drivers/leds/ledtrig-gpio.c: make output match input, tighten input checking
        MAINTAINERS: update Atmel AT91 entry
        mm: fix truncate_setsize() comment
        memcg: fix rmdir, force_empty with THP
        memcg: fix LRU accounting with THP
        memcg: fix USED bit handling at uncharge in THP
        memcg: modify accounting function for supporting THP better
        fs/direct-io.c: don't try to allocate more than BIO_MAX_PAGES in a bio
        mm: compaction: prevent division-by-zero during user-requested compaction
        mm/vmscan.c: remove duplicate include of compaction.h
        memblock: fix memblock_is_region_memory()
        thp: keep highpte mapped until it is no longer needed
        kconfig: rename CONFIG_EMBEDDED to CONFIG_EXPERT
      8d99641f
    • Milton Miller's avatar
      kernel/smp.c: consolidate writes in smp_call_function_interrupt() · 225c8e01
      Milton Miller authored
      We have to test the cpu mask in the interrupt handler before checking the
      refs, otherwise we can start to follow an entry before its deleted and
      find it partially initailzed for the next trip.  Presently we also clear
      the cpumask bit before executing the called function, which implies
      getting write access to the line.  After the function is called we then
      decrement refs, and if they go to zero we then unlock the structure.
      
      However, this implies getting write access to the call function data
      before and after another the function is called.  If we can assert that no
      smp_call_function execution function is allowed to enable interrupts, then
      we can move both writes to after the function is called, hopfully allowing
      both writes with one cache line bounce.
      
      On a 256 thread system with a kernel compiled for 1024 threads, the time
      to execute testcase in the "smp_call_function_many race" changelog was
      reduced by about 30-40ms out of about 545 ms.
      
      I decided to keep this as WARN because its now a buggy function, even
      though the stack trace is of no value -- a simple printk would give us the
      information needed.
      
      Raw data:
      
      Without patch:
        ipi_test startup took 1219366ns complete 539819014ns total 541038380ns
        ipi_test startup took 1695754ns complete 543439872ns total 545135626ns
        ipi_test startup took 7513568ns complete 539606362ns total 547119930ns
        ipi_test startup took 13304064ns complete 533898562ns total 547202626ns
        ipi_test startup took 8668192ns complete 544264074ns total 552932266ns
        ipi_test startup took 4977626ns complete 548862684ns total 553840310ns
        ipi_test startup took 2144486ns complete 541292318ns total 543436804ns
        ipi_test startup took 21245824ns complete 530280180ns total 551526004ns
      
      With patch:
        ipi_test startup took 5961748ns complete 500859628ns total 506821376ns
        ipi_test startup took 8975996ns complete 495098924ns total 504074920ns
        ipi_test startup took 19797750ns complete 492204740ns total 512002490ns
        ipi_test startup took 14824796ns complete 487495878ns total 502320674ns
        ipi_test startup took 11514882ns complete 494439372ns total 505954254ns
        ipi_test startup took 8288084ns complete 502570774ns total 510858858ns
        ipi_test startup took 6789954ns complete 493388112ns total 500178066ns
      
      	#include <linux/module.h>
      	#include <linux/init.h>
      	#include <linux/sched.h> /* sched clock */
      
      	#define ITERATIONS 100
      
      	static void do_nothing_ipi(void *dummy)
      	{
      	}
      
      	static void do_ipis(struct work_struct *dummy)
      	{
      		int i;
      
      		for (i = 0; i < ITERATIONS; i++)
      			smp_call_function(do_nothing_ipi, NULL, 1);
      
      		printk(KERN_DEBUG "cpu %d finished\n", smp_processor_id());
      	}
      
      	static struct work_struct work[NR_CPUS];
      
      	static int __init testcase_init(void)
      	{
      		int cpu;
      		u64 start, started, done;
      
      		start = local_clock();
      		for_each_online_cpu(cpu) {
      			INIT_WORK(&work[cpu], do_ipis);
      			schedule_work_on(cpu, &work[cpu]);
      		}
      		started = local_clock();
      		for_each_online_cpu(cpu)
      			flush_work(&work[cpu]);
      		done = local_clock();
      		pr_info("ipi_test startup took %lldns complete %lldns total %lldns\n",
      			started-start, done-started, done-start);
      
      		return 0;
      	}
      
      	static void __exit testcase_exit(void)
      	{
      	}
      
      	module_init(testcase_init)
      	module_exit(testcase_exit)
      	MODULE_LICENSE("GPL");
      	MODULE_AUTHOR("Anton Blanchard");
      Signed-off-by: default avatarMilton Miller <miltonm@bga.com>
      Cc: Anton Blanchard <anton@samba.org>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      225c8e01
    • Anton Blanchard's avatar
      kernel/smp.c: fix smp_call_function_many() SMP race · 6dc19899
      Anton Blanchard authored
      I noticed a failure where we hit the following WARN_ON in
      generic_smp_call_function_interrupt:
      
                      if (!cpumask_test_and_clear_cpu(cpu, data->cpumask))
                              continue;
      
                      data->csd.func(data->csd.info);
      
                      refs = atomic_dec_return(&data->refs);
                      WARN_ON(refs < 0);      <-------------------------
      
      We atomically tested and cleared our bit in the cpumask, and yet the
      number of cpus left (ie refs) was 0.  How can this be?
      
      It turns out commit 54fdade1
      ("generic-ipi: make struct call_function_data lockless") is at fault.  It
      removes locking from smp_call_function_many and in doing so creates a
      rather complicated race.
      
      The problem comes about because:
      
       - The smp_call_function_many interrupt handler walks call_function.queue
         without any locking.
       - We reuse a percpu data structure in smp_call_function_many.
       - We do not wait for any RCU grace period before starting the next
         smp_call_function_many.
      
      Imagine a scenario where CPU A does two smp_call_functions back to back,
      and CPU B does an smp_call_function in between.  We concentrate on how CPU
      C handles the calls:
      
      CPU A            CPU B                  CPU C              CPU D
      
      smp_call_function
                                              smp_call_function_interrupt
                                                  walks
      					call_function.queue sees
      					data from CPU A on list
      
                       smp_call_function
      
                                              smp_call_function_interrupt
                                                  walks
      
                                              call_function.queue sees
                                                (stale) CPU A on list
      							   smp_call_function int
      							   clears last ref on A
      							   list_del_rcu, unlock
      smp_call_function reuses
      percpu *data A
                                               data->cpumask sees and
                                               clears bit in cpumask
                                               might be using old or new fn!
                                               decrements refs below 0
      
      set data->refs (too late!)
      
      The important thing to note is since the interrupt handler walks a
      potentially stale call_function.queue without any locking, then another
      cpu can view the percpu *data structure at any time, even when the owner
      is in the process of initialising it.
      
      The following test case hits the WARN_ON 100% of the time on my PowerPC
      box (having 128 threads does help :)
      
      #include <linux/module.h>
      #include <linux/init.h>
      
      #define ITERATIONS 100
      
      static void do_nothing_ipi(void *dummy)
      {
      }
      
      static void do_ipis(struct work_struct *dummy)
      {
      	int i;
      
      	for (i = 0; i < ITERATIONS; i++)
      		smp_call_function(do_nothing_ipi, NULL, 1);
      
      	printk(KERN_DEBUG "cpu %d finished\n", smp_processor_id());
      }
      
      static struct work_struct work[NR_CPUS];
      
      static int __init testcase_init(void)
      {
      	int cpu;
      
      	for_each_online_cpu(cpu) {
      		INIT_WORK(&work[cpu], do_ipis);
      		schedule_work_on(cpu, &work[cpu]);
      	}
      
      	return 0;
      }
      
      static void __exit testcase_exit(void)
      {
      }
      
      module_init(testcase_init)
      module_exit(testcase_exit)
      MODULE_LICENSE("GPL");
      MODULE_AUTHOR("Anton Blanchard");
      
      I tried to fix it by ordering the read and the write of ->cpumask and
      ->refs.  In doing so I missed a critical case but Paul McKenney was able
      to spot my bug thankfully :) To ensure we arent viewing previous
      iterations the interrupt handler needs to read ->refs then ->cpumask then
      ->refs _again_.
      
      Thanks to Milton Miller and Paul McKenney for helping to debug this issue.
      
      [miltonm@bga.com: add WARN_ON and BUG_ON, remove extra read of refs before initial read of mask that doesn't help (also noted by Peter Zijlstra), adjust comments, hopefully clarify scenario ]
      [miltonm@bga.com: remove excess tests]
      Signed-off-by: default avatarAnton Blanchard <anton@samba.org>
      Signed-off-by: default avatarMilton Miller <miltonm@bga.com>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
      Cc: <stable@kernel.org> [2.6.32+]
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      6dc19899
    • Johannes Weiner's avatar
      memcg: correctly order reading PCG_USED and pc->mem_cgroup · 713735b4
      Johannes Weiner authored
      The placement of the read-side barrier is confused: the writer first
      sets pc->mem_cgroup, then PCG_USED.  The read-side barrier has to be
      between testing PCG_USED and reading pc->mem_cgroup.
      Signed-off-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: default avatarKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Acked-by: default avatarDaisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
      Cc: Balbir Singh <balbir@linux.vnet.ibm.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      713735b4
    • Randy Dunlap's avatar
      backlight: fix 88pm860x_bl macro collision · 2550326a
      Randy Dunlap authored
      Fix collision with kernel-supplied #define:
      
        drivers/video/backlight/88pm860x_bl.c:24:1: warning: "CURRENT_MASK" redefined
        arch/x86/include/asm/page_64_types.h:6:1: warning: this is the location of the previous definition
      Signed-off-by: default avatarRandy Dunlap <randy.dunlap@oracle.com>
      Cc: Haojian Zhuang <haojian.zhuang@marvell.com>
      Cc: Richard Purdie <rpurdie@rpsys.net>
      Cc: <stable@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      2550326a
    • Janusz Krzysztofik's avatar
      drivers/leds/ledtrig-gpio.c: make output match input, tighten input checking · cc587ece
      Janusz Krzysztofik authored
      Replicate changes made to drivers/leds/ledtrig-backlight.c.
      
      Cc: Paul Mundt <lethal@linux-sh.org>
      Cc: Richard Purdie <richard.purdie@linuxfoundation.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      cc587ece
    • Nicolas Ferre's avatar
      MAINTAINERS: update Atmel AT91 entry · c1fc8675
      Nicolas Ferre authored
      Add two co-maintainers and update the entry with new information.
      Signed-off-by: default avatarNicolas Ferre <nicolas.ferre@atmel.com>
      Acked-by: default avatarAndrew Victor <linux@maxim.org.za>
      Acked-by: default avatarJean-Christophe PLAGNIOL-VILLARD <plagnioj@jcrosoft.com>
      Cc: Russell King <rmk@arm.linux.org.uk>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      c1fc8675
    • Jan Kara's avatar
      mm: fix truncate_setsize() comment · 382e27da
      Jan Kara authored
      Contrary to what the comment says, truncate_setsize() should be called
      *before* filesystem truncated blocks.
      Signed-off-by: default avatarJan Kara <jack@suse.cz>
      Cc: Christoph Hellwig <hch@infradead.org>
      Cc: Al Viro <viro@ZenIV.linux.org.uk>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      382e27da
    • KAMEZAWA Hiroyuki's avatar
      memcg: fix rmdir, force_empty with THP · 987eba66
      KAMEZAWA Hiroyuki authored
      Now, when THP is enabled, memcg's rmdir() function is broken because
      move_account() for THP page is not supported.
      
      This will cause account leak or -EBUSY issue at rmdir().
      This patch fixes the issue by supporting move_account() THP pages.
      Signed-off-by: default avatarKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
      Cc: Balbir Singh <balbir@linux.vnet.ibm.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      987eba66
    • KAMEZAWA Hiroyuki's avatar
      memcg: fix LRU accounting with THP · ece35ca8
      KAMEZAWA Hiroyuki authored
      memory cgroup's LRU stat should take care of size of pages because
      Transparent Hugepage inserts hugepage into LRU.  If this value is the
      number wrong, memory reclaim will not work well.
      
      Note: only head page of THP's huge page is linked into LRU.
      Signed-off-by: default avatarKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
      Cc: Balbir Singh <balbir@linux.vnet.ibm.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      ece35ca8
    • KAMEZAWA Hiroyuki's avatar
      memcg: fix USED bit handling at uncharge in THP · ca3e0214
      KAMEZAWA Hiroyuki authored
      Now, under THP:
      
      at charge:
        - PageCgroupUsed bit is set to all page_cgroup on a hugepage.
          ....set to 512 pages.
      at uncharge
        - PageCgroupUsed bit is unset on the head page.
      
      So, some pages will remain with "Used" bit.
      
      This patch fixes that Used bit is set only to the head page.
      Used bits for tail pages will be set at splitting if necessary.
      
      This patch adds this lock order:
         compound_lock() -> page_cgroup_move_lock().
      
      [akpm@linux-foundation.org: fix warning]
      Signed-off-by: default avatarKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
      Cc: Balbir Singh <balbir@linux.vnet.ibm.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      ca3e0214
    • KAMEZAWA Hiroyuki's avatar
      memcg: modify accounting function for supporting THP better · e401f176
      KAMEZAWA Hiroyuki authored
      mem_cgroup_charge_statisics() was designed for charging a page but now, we
      have transparent hugepage.  To fix problems (in following patch) it's
      required to change the function to get the number of pages as its
      arguments.
      
      The new function gets following as argument.
        - type of page rather than 'pc'
        - size of page which is accounted.
      Signed-off-by: default avatarKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
      Cc: Balbir Singh <balbir@linux.vnet.ibm.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      e401f176
    • David Dillow's avatar
      fs/direct-io.c: don't try to allocate more than BIO_MAX_PAGES in a bio · 20d9600c
      David Dillow authored
      When using devices that support max_segments > BIO_MAX_PAGES (256), direct
      IO tries to allocate a bio with more pages than allowed, which leads to an
      oops in dio_bio_alloc().  Clamp the request to the supported maximum, and
      change dio_bio_alloc() to reflect that bio_alloc() will always return a
      bio when called with __GFP_WAIT and a valid number of vectors.
      
      [akpm@linux-foundation.org: remove redundant BUG_ON()]
      Signed-off-by: default avatarDavid Dillow <dillowda@ornl.gov>
      Reviewed-by: default avatarJeff Moyer <jmoyer@redhat.com>
      Cc: <stable@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      20d9600c
    • Johannes Weiner's avatar
      mm: compaction: prevent division-by-zero during user-requested compaction · 82478fb7
      Johannes Weiner authored
      Up until 3e7d3449 ("mm: vmscan: reclaim order-0 and use compaction instead
      of lumpy reclaim"), compaction skipped calculating the fragmentation index
      of a zone when compaction was explicitely requested through the procfs
      knob.
      
      However, when compaction_suitable was introduced, it did not come with an
      extra check for order == -1, set on explicit compaction requests, and
      passed this order on to the fragmentation index calculation, where it
      overshifts the number of requested pages, leading to a division by zero.
      
      This patch makes sure that order == -1 is recognized as the flag it is
      rather than passing it along as valid order parameter.
      
      [akpm@linux-foundation.org: add comment, per Mel]
      Signed-off-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Reviewed-by: default avatarMel Gorman <mel@csn.ul.ie>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      82478fb7
    • Jesper Juhl's avatar
    • Tomi Valkeinen's avatar
      memblock: fix memblock_is_region_memory() · abb65272
      Tomi Valkeinen authored
      memblock_is_region_memory() uses reserved memblocks to search for the
      given region, while it should use the memory memblocks.
      
      I encountered the problem with OMAP's framebuffer ram allocation.
      Normally the ram is allocated dynamically, and this function is not
      called.  However, if we want to pass the framebuffer from the bootloader
      to the kernel (to retain the boot image), this function is used to check
      the validity of the kernel parameters for the framebuffer ram area.
      Signed-off-by: default avatarTomi Valkeinen <tomi.valkeinen@nokia.com>
      Acked-by: default avatarYinghai Lu <yinghai@kernel.org>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Ingo Molnar <mingo@elte.hu>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      abb65272
    • Johannes Weiner's avatar
      thp: keep highpte mapped until it is no longer needed · 453c7192
      Johannes Weiner authored
      Two users reported THP-related crashes on 32-bit x86 machines.  Their oops
      reports indicated an invalid pte, and subsequent code inspection showed
      that the highpte is actually used after unmap.
      
      The fix is to unmap the pte only after all operations against it are
      finished.
      Signed-off-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Reported-by: default avatarIlya Dryomov <idryomov@gmail.com>
      Reported-by: default avatarwerner <w.landgraf@ru.ru>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Tested-by: default avatarIlya Dryomov <idryomov@gmail.com>
      Tested-by: Steven Rostedt <rostedt@goodmis.org
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      453c7192
    • David Rientjes's avatar
      kconfig: rename CONFIG_EMBEDDED to CONFIG_EXPERT · 6a108a14
      David Rientjes authored
      The meaning of CONFIG_EMBEDDED has long since been obsoleted; the option
      is used to configure any non-standard kernel with a much larger scope than
      only small devices.
      
      This patch renames the option to CONFIG_EXPERT in init/Kconfig and fixes
      references to the option throughout the kernel.  A new CONFIG_EMBEDDED
      option is added that automatically selects CONFIG_EXPERT when enabled and
      can be used in the future to isolate options that should only be
      considered for embedded systems (RISC architectures, SLOB, etc).
      
      Calling the option "EXPERT" more accurately represents its intention: only
      expert users who understand the impact of the configuration changes they
      are making should enable it.
      Reviewed-by: default avatarIngo Molnar <mingo@elte.hu>
      Acked-by: default avatarDavid Woodhouse <david.woodhouse@intel.com>
      Signed-off-by: default avatarDavid Rientjes <rientjes@google.com>
      Cc: Greg KH <gregkh@suse.de>
      Cc: "David S. Miller" <davem@davemloft.net>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Cc: Robin Holt <holt@sgi.com>
      Cc: <linux-arch@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      6a108a14
    • Linus Torvalds's avatar
      Merge branch 'tty-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/tty-2.6 · fc887b15
      Linus Torvalds authored
      * 'tty-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/tty-2.6:
        tty: update MAINTAINERS file due to driver movement
        tty: move drivers/serial/ to drivers/tty/serial/
        tty: move hvc drivers to drivers/tty/hvc/
      fc887b15
    • Linus Torvalds's avatar
      Merge branch 'sched-fixes-for-linus' of... · 466c1906
      Linus Torvalds authored
      Merge branch 'sched-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip
      
      * 'sched-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
        sched, cgroup: Use exit hook to avoid use-after-free crash
        sched: Fix signed unsigned comparison in check_preempt_tick()
        sched: Replace rq->bkl_count with rq->rq_sched_info.bkl_count
        sched, autogroup: Fix CONFIG_RT_GROUP_SCHED sched_setscheduler() failure
        sched: Display autogroup names in /proc/sched_debug
        sched: Reinstate group names in /proc/sched_debug
        sched: Update effective_load() to use global share weights
      466c1906
    • Linus Torvalds's avatar
      Merge branch 'xen/xenbus' of git://git.kernel.org/pub/scm/linux/kernel/git/jeremy/xen · 67290f41
      Linus Torvalds authored
      * 'xen/xenbus' of git://git.kernel.org/pub/scm/linux/kernel/git/jeremy/xen:
        xenbus: Fix memory leak on release
        xenbus: avoid zero returns from read()
        xenbus: add missing wakeup in concurrent read/write
        xenbus: allow any xenbus command over /proc/xen/xenbus
        xenfs/xenbus: report partial reads/writes correctly
      67290f41
    • Linus Torvalds's avatar
      Merge branch 'master' of git://git.kernel.org/pub/scm/linux/kernel/git/sfrench/cifs-2.6 · 5cdec1fc
      Linus Torvalds authored
      * 'master' of git://git.kernel.org/pub/scm/linux/kernel/git/sfrench/cifs-2.6:
        cifs: mangle existing header for SMB_COM_NT_CANCEL
        cifs: remove code for setting timeouts on requests
        [CIFS] cifs: reconnect unresponsive servers
        cifs: set up recurring workqueue job to do SMB echo requests
        cifs: add ability to send an echo request
        cifs: add cifs_call_async
        cifs: allow for different handling of received response
        cifs: clean up sync_mid_result
        cifs: don't reconnect server when we don't get a response
        cifs: wait indefinitely for responses
        cifs: Use mask of ACEs for SID Everyone to calculate all three permissions user, group, and other
        cifs: Fix regression during share-level security mounts (Repost)
        [CIFS] Update cifs version number
        cifs: move mid result processing into common function
        cifs: move locked sections out of DeleteMidQEntry and AllocMidQEntry
        cifs: clean up accesses to midCount
        cifs: make wait_for_free_request take a TCP_Server_Info pointer
        cifs: no need to mark smb_ses_list as cifs_demultiplex_thread is exiting
        cifs: don't fail writepages on -EAGAIN errors
        CIFS: Fix oplock break handling (try #2)
      5cdec1fc