1. 22 Mar, 2012 32 commits
    • Naoya Horiguchi's avatar
      pagemap: introduce data structure for pagemap entry · 092b50ba
      Naoya Horiguchi authored
      Currently a local variable of pagemap entry in pagemap_pte_range() is
      named pfn and typed with u64, but it's not correct (pfn should be unsigned
      long.)
      
      This patch introduces special type for pagemap entries and replaces code
      with it.
      Signed-off-by: default avatarNaoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Andi Kleen <andi@firstfloor.org>
      Cc: Wu Fengguang <fengguang.wu@intel.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Reviewed-by: default avatarKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      092b50ba
    • Naoya Horiguchi's avatar
      pagemap: document KPF_THP and make page-types aware of it · 807f0ccf
      Naoya Horiguchi authored
      page-types, which is a common user of pagemap, gets aware of thp with this
      patch.  This helps system admins and kernel hackers know about how thp
      works.  Here is a sample output of page-types over a thp:
      
        $ page-types -p <pid> --raw --list
      
        voffset offset  len     flags
        ...
        7f9d40200       3f8400  1       ___U_lA____Ma_bH______t____________
        7f9d40201       3f8401  1ff     ________________T_____t____________
      
                     flags      page-count       MB  symbolic-flags                     long-symbolic-flags
        0x0000000000410000             511        1  ________________T_____t____________        compound_tail,thp
        0x000000000040d868               1        0  ___U_lA____Ma_bH______t____________        uptodate,lru,active,mmap,anonymous,swapbacked,compound_head,thp
      Signed-off-by: default avatarNaoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Acked-by: default avatarWu Fengguang <fengguang.wu@intel.com>
      Reviewed-by: default avatarKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Acked-by: default avatarKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Andi Kleen <andi@firstfloor.org>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      807f0ccf
    • Naoya Horiguchi's avatar
      pagemap: export KPF_THP · e873c49f
      Naoya Horiguchi authored
      This flag shows that a given page is a subpage of a transparent hugepage.
      It helps us debug and test the kernel by showing physical address of thp.
      Signed-off-by: default avatarNaoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Reviewed-by: default avatarWu Fengguang <fengguang.wu@intel.com>
      Reviewed-by: default avatarKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Acked-by: default avatarKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Andi Kleen <andi@firstfloor.org>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      e873c49f
    • Naoya Horiguchi's avatar
      thp: optimize away unnecessary page table locking · 025c5b24
      Naoya Horiguchi authored
      Currently when we check if we can handle thp as it is or we need to split
      it into regular sized pages, we hold page table lock prior to check
      whether a given pmd is mapping thp or not.  Because of this, when it's not
      "huge pmd" we suffer from unnecessary lock/unlock overhead.  To remove it,
      this patch introduces a optimized check function and replace several
      similar logics with it.
      
      [akpm@linux-foundation.org: checkpatch fixes]
      Signed-off-by: default avatarNaoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Andi Kleen <andi@firstfloor.org>
      Cc: Wu Fengguang <fengguang.wu@intel.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Reviewed-by: default avatarKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Jiri Slaby <jslaby@suse.cz>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      025c5b24
    • Naoya Horiguchi's avatar
      pagemap: avoid splitting thp when reading /proc/pid/pagemap · 5aaabe83
      Naoya Horiguchi authored
      Thp split is not necessary if we explicitly check whether pmds are mapping
      thps or not.  This patch introduces this check and adds code to generate
      pagemap entries for pmds mapping thps, which results in less performance
      impact of pagemap on thp.
      Signed-off-by: default avatarNaoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Reviewed-by: default avatarAndi Kleen <ak@linux.intel.com>
      Reviewed-by: default avatarKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Wu Fengguang <fengguang.wu@intel.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      5aaabe83
    • Xiao Guangrong's avatar
      mm: search from free_area_cache for the bigger size · b716ad95
      Xiao Guangrong authored
      If the required size is bigger than cached_hole_size it is better to
      search from free_area_cache - it is easier to get a free region,
      specifically for the 64 bit process whose address space is large enough
      
      Do it just as hugetlb_get_unmapped_area_topdown() in arch/x86/mm/hugetlbpage.c
      Signed-off-by: default avatarXiao Guangrong <xiaoguangrong@linux.vnet.ibm.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: Hillf Danton <dhillf@gmail.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Rik van Riel <riel@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      b716ad95
    • Xiao Guangrong's avatar
      mm: do not reset cached_hole_size when vma is unmapped · f44d2198
      Xiao Guangrong authored
      In the current code, cached_hole_size is set to the maximum value if the
      unmapped vma is less that free_area_cache so the next search will search
      from the base address.
      
      Actually, we can keep cached_hole_size so that if the next required size
      is more than cached_hole_size, it can search from free_area_cache.
      Signed-off-by: default avatarXiao Guangrong <xiaoguangrong@linux.vnet.ibm.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: Hillf Danton <dhillf@gmail.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Rik van Riel <riel@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      f44d2198
    • Xiao Guangrong's avatar
      hugetlb: try to search again if it is really needed · cbde83e2
      Xiao Guangrong authored
      Search again only if some holes may be skipped in the first pass.
      
      [akpm@linux-foundation.org: clean up crazy compound definition]
      Signed-off-by: default avatarXiao Guangrong <xiaoguangrong@linux.vnet.ibm.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: Hillf Danton <dhillf@gmail.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      cbde83e2
    • Xiao Guangrong's avatar
      hugetlbfs: fix hugetlb_get_unmapped_area() · 4bfc130d
      Xiao Guangrong authored
      Use/update cached_hole_size and free_area_cache properly to speedup
      finding of a free region.
      Signed-off-by: default avatarXiao Guangrong <xiaoguangrong@linux.vnet.ibm.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: Hillf Danton <dhillf@gmail.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      4bfc130d
    • Dan Carpenter's avatar
      mm: compaction: make compact_control order signed · aad6ec37
      Dan Carpenter authored
      "order" is -1 when compacting via /proc/sys/vm/compact_memory.  Making
      it unsigned causes a bug in __compact_pgdat() when we test:
      
      	if (cc->order < 0 || !compaction_deferred(zone, cc->order))
      		compact_zone(zone, cc);
      
      [akpm@linux-foundation.org: make __compact_pgdat()'s comparison match other code sites]
      Signed-off-by: default avatarDan Carpenter <dan.carpenter@oracle.com>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Cc: Minchan Kim <minchan@kernel.org>
      Reviewed-by: default avatarRik van Riel <riel@redhat.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      aad6ec37
    • Hugh Dickins's avatar
      compact_pgdat: workaround lockdep warning in kswapd · 8575ec29
      Hugh Dickins authored
      I get this lockdep warning from swapping load on linux-next, due to
      "vmscan: kswapd carefully call compaction".
      
      =================================
      [ INFO: inconsistent lock state ]
      3.3.0-rc2-next-20120201 #5 Not tainted
      ---------------------------------
      inconsistent {RECLAIM_FS-ON-W} -> {IN-RECLAIM_FS-W} usage.
      kswapd0/28 [HC0[0]:SC0[0]:HE1:SE1] takes:
       (pcpu_alloc_mutex){+.+.?.}, at: [<ffffffff810d6684>] pcpu_alloc+0x67/0x325
      {RECLAIM_FS-ON-W} state was registered at:
        [<ffffffff81099b75>] mark_held_locks+0xd7/0x103
        [<ffffffff8109a13c>] lockdep_trace_alloc+0x85/0x9e
        [<ffffffff810f6bdc>] __kmalloc+0x6c/0x14b
        [<ffffffff810d57fd>] pcpu_mem_zalloc+0x59/0x62
        [<ffffffff810d5d16>] pcpu_extend_area_map+0x26/0xb1
        [<ffffffff810d679f>] pcpu_alloc+0x182/0x325
        [<ffffffff810d694d>] __alloc_percpu+0xb/0xd
        [<ffffffff8142ebfd>] snmp_mib_init+0x1e/0x2e
        [<ffffffff8185cd8d>] ipv4_mib_init_net+0x7a/0x184
        [<ffffffff813dc963>] ops_init.clone.0+0x6b/0x73
        [<ffffffff813dc9cc>] register_pernet_operations+0x61/0xa0
        [<ffffffff813dca8e>] register_pernet_subsys+0x29/0x42
        [<ffffffff8185d044>] inet_init+0x1ad/0x252
        [<ffffffff810002e3>] do_one_initcall+0x7a/0x12f
        [<ffffffff81832bc5>] kernel_init+0x9d/0x11e
        [<ffffffff814e51e4>] kernel_thread_helper+0x4/0x10
      irq event stamp: 656613
      hardirqs last  enabled at (656613): [<ffffffff814e0ddc>] __mutex_unlock_slowpath+0x104/0x128
      hardirqs last disabled at (656612): [<ffffffff814e0d34>] __mutex_unlock_slowpath+0x5c/0x128
      softirqs last  enabled at (655568): [<ffffffff8105b4a5>] __do_softirq+0x120/0x136
      softirqs last disabled at (654757): [<ffffffff814e52dc>] call_softirq+0x1c/0x30
      
      other info that might help us debug this:
       Possible unsafe locking scenario:
      
             CPU0
             ----
        lock(pcpu_alloc_mutex);
        <Interrupt>
          lock(pcpu_alloc_mutex);
      
       *** DEADLOCK ***
      
      no locks held by kswapd0/28.
      
      stack backtrace:
      Pid: 28, comm: kswapd0 Not tainted 3.3.0-rc2-next-20120201 #5
      Call Trace:
       [<ffffffff810981f4>] print_usage_bug+0x1bf/0x1d0
       [<ffffffff81096c3e>] ? print_irq_inversion_bug+0x1d9/0x1d9
       [<ffffffff810982c0>] mark_lock_irq+0xbb/0x22e
       [<ffffffff810c5399>] ? free_hot_cold_page+0x13d/0x14f
       [<ffffffff81098684>] mark_lock+0x251/0x331
       [<ffffffff81098893>] mark_irqflags+0x12f/0x141
       [<ffffffff81098e32>] __lock_acquire+0x58d/0x753
       [<ffffffff810d6684>] ? pcpu_alloc+0x67/0x325
       [<ffffffff81099433>] lock_acquire+0x54/0x6a
       [<ffffffff810d6684>] ? pcpu_alloc+0x67/0x325
       [<ffffffff8107a5b8>] ? add_preempt_count+0xa9/0xae
       [<ffffffff814e0a21>] mutex_lock_nested+0x5e/0x315
       [<ffffffff810d6684>] ? pcpu_alloc+0x67/0x325
       [<ffffffff81098f81>] ? __lock_acquire+0x6dc/0x753
       [<ffffffff810c9fb0>] ? __pagevec_release+0x2c/0x2c
       [<ffffffff810d6684>] pcpu_alloc+0x67/0x325
       [<ffffffff810c9fb0>] ? __pagevec_release+0x2c/0x2c
       [<ffffffff810d694d>] __alloc_percpu+0xb/0xd
       [<ffffffff8106c35e>] schedule_on_each_cpu+0x23/0x110
       [<ffffffff810c9fcb>] lru_add_drain_all+0x10/0x12
       [<ffffffff810f126f>] __compact_pgdat+0x20/0x182
       [<ffffffff810f15c2>] compact_pgdat+0x27/0x29
       [<ffffffff810c306b>] ? zone_watermark_ok+0x1a/0x1c
       [<ffffffff810cdf6f>] balance_pgdat+0x732/0x751
       [<ffffffff810ce0ed>] kswapd+0x15f/0x178
       [<ffffffff810cdf8e>] ? balance_pgdat+0x751/0x751
       [<ffffffff8106fd11>] kthread+0x84/0x8c
       [<ffffffff814e51e4>] kernel_thread_helper+0x4/0x10
       [<ffffffff810787ed>] ? finish_task_switch+0x85/0xea
       [<ffffffff814e3861>] ? retint_restore_args+0xe/0xe
       [<ffffffff8106fc8d>] ? __init_kthread_worker+0x56/0x56
       [<ffffffff814e51e0>] ? gs_change+0xb/0xb
      
      The RECLAIM_FS notations indicate that it's doing the GFP_FS checking that
      Nick hacked into lockdep a while back: I think we're intended to read that
      "<Interrupt>" in the DEADLOCK scenario as "<Direct reclaim>".
      
      I'm hazy, I have not reached any conclusion as to whether it's right to
      complain or not; but I believe it's uneasy about kswapd now doing the
      mutex_lock(&pcpu_alloc_mutex) which lru_add_drain_all() entails.  Nor have
      I reached any conclusion as to whether it's important for kswapd to do
      that draining or not.
      
      But so as not to get blocked on this, with lockdep disabled from giving
      further reports, here's a patch which removes the lru_add_drain_all() from
      kswapd's callpath (and calls it only once from compact_nodes(), instead of
      once per node).
      Signed-off-by: default avatarHugh Dickins <hughd@google.com>
      Cc: Rik van Riel <riel@redhat.com>
      Acked-by: default avatarMel Gorman <mel@csn.ul.ie>
      Cc: Tejun Heo <tj@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      8575ec29
    • Rik van Riel's avatar
      vmscan: only defer compaction for failed order and higher · aff62249
      Rik van Riel authored
      Currently a failed order-9 (transparent hugepage) compaction can lead to
      memory compaction being temporarily disabled for a memory zone.  Even if
      we only need compaction for an order 2 allocation, eg.  for jumbo frames
      networking.
      
      The fix is relatively straightforward: keep track of the highest order at
      which compaction is succeeding, and only defer compaction for orders at
      which compaction is failing.
      Signed-off-by: default avatarRik van Riel <riel@redhat.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Acked-by: default avatarMel Gorman <mel@csn.ul.ie>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Minchan Kim <minchan.kim@gmail.com>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Hillf Danton <dhillf@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      aff62249
    • Rik van Riel's avatar
      vmscan: kswapd carefully call compaction · 7be62de9
      Rik van Riel authored
      With CONFIG_COMPACTION enabled, kswapd does not try to free contiguous
      free pages, even when it is woken for a higher order request.
      
      This could be bad for eg.  jumbo frame network allocations, which are done
      from interrupt context and cannot compact memory themselves.  Higher than
      before allocation failure rates in the network receive path have been
      observed in kernels with compaction enabled.
      
      Teach kswapd to defragment the memory zones in a node, but only if
      required and compaction is not deferred in a zone.
      
      [akpm@linux-foundation.org: reduce scope of zones_need_compaction]
      Signed-off-by: default avatarRik van Riel <riel@redhat.com>
      Acked-by: default avatarMel Gorman <mel@csn.ul.ie>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Minchan Kim <minchan.kim@gmail.com>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Hillf Danton <dhillf@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      7be62de9
    • Rik van Riel's avatar
      vmscan: reclaim at order 0 when compaction is enabled · fe2c2a10
      Rik van Riel authored
      When built with CONFIG_COMPACTION, kswapd should not try to free
      contiguous pages, because it is not trying hard enough to have a real
      chance at being successful, but still disrupts the LRU enough to break
      other things.
      
      Do not do higher order page isolation unless we really are in lumpy
      reclaim mode.
      
      Stop reclaiming pages once we have enough free pages that compaction can
      deal with things, and we hit the normal order 0 watermarks used by kswapd.
      
      Also remove a line of code that increments balanced right before exiting
      the function.
      Signed-off-by: default avatarRik van Riel <riel@redhat.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Acked-by: default avatarMel Gorman <mel@csn.ul.ie>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Minchan Kim <minchan.kim@gmail.com>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Hillf Danton <dhillf@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      fe2c2a10
    • Rik van Riel's avatar
      mm: make swapin readahead skip over holes · 67f96aa2
      Rik van Riel authored
      Ever since abandoning the virtual scan of processes, for scalability
      reasons, swap space has been a little more fragmented than before.  This
      can lead to the situation where a large memory user is killed, swap space
      ends up full of "holes" and swapin readahead is totally ineffective.
      
      On my home system, after killing a leaky firefox it took over an hour to
      page just under 2GB of memory back in, slowing the virtual machines down
      to a crawl.
      
      This patch makes swapin readahead simply skip over holes, instead of
      stopping at them.  This allows the system to swap things back in at rates
      of several MB/second, instead of a few hundred kB/second.
      
      The checks done in valid_swaphandles are already done in
      read_swap_cache_async as well, allowing us to remove a fair amount of
      code.
      
      [akpm@linux-foundation.org: fix it for page_cluster >= 32]
      Signed-off-by: default avatarRik van Riel <riel@redhat.com>
      Cc: Minchan Kim <minchan.kim@gmail.com>
      Cc: KOSAKI Motohiro <kosaki.motohiro@gmail.com>
      Acked-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: default avatarMel Gorman <mgorman@suse.de>
      Cc: Adrian Drzewiecki <z@drze.net>
      Cc: Hugh Dickins <hughd@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      67f96aa2
    • Hillf Danton's avatar
      mm: vmscan: fix misused nr_reclaimed in shrink_mem_cgroup_zone() · c38446cc
      Hillf Danton authored
      The value of nr_reclaimed is the number of pages reclaimed in the current
      round of the loop, whereas nr_to_reclaim should be compared with the
      number of pages reclaimed in all rounds.
      
      In each round of the loop, reclaimed pages are cut off from the reclaim
      goal, and the loop stops once the goal achieved.
      Signed-off-by: default avatarHillf Danton <dhillf@gmail.com>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      c38446cc
    • Konstantin Khlebnikov's avatar
      mm: make get_mm_counter static-inline · 69c97823
      Konstantin Khlebnikov authored
      Make get_mm_counter() always static inline, it is simple enough for that.
      And remove unused set_mm_counter()
      
      bloat-o-meter:
      
      add/remove: 0/1 grow/shrink: 4/12 up/down: 99/-341 (-242)
      function                                     old     new   delta
      try_to_unmap_one                             886     952     +66
      sys_remap_file_pages                        1214    1230     +16
      dup_mm                                      1684    1700     +16
      do_exit                                     2277    2278      +1
      zap_page_range                               208     205      -3
      unmap_region                                 304     296      -8
      static.oom_kill_process                      554     546      -8
      try_to_unmap_file                           1716    1700     -16
      getrusage                                    925     909     -16
      flush_old_exec                              1704    1688     -16
      static.dump_header                           416     390     -26
      acct_update_integrals                        218     187     -31
      do_task_stat                                2986    2954     -32
      get_mm_counter                                34       -     -34
      xacct_add_tsk                                371     334     -37
      task_statm                                   172     118     -54
      task_mem                                     383     323     -60
      
      try_to_unmap_one() grows because update_hiwater_rss() now completely inline.
      Signed-off-by: default avatarKonstantin Khlebnikov <khlebnikov@openvz.org>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Acked-by: default avatarKirill A. Shutemov <kirill@shutemov.name>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      69c97823
    • Hillf Danton's avatar
      mm/vmscan.c: cleanup with s/reclaim_mode/isolate_mode/ · 61317289
      Hillf Danton authored
      With tons of reclaim_mode (defined as one field of struct scan_control)
      already in the file, it is clearer to rename the local reclaim_mode when
      setting up the isolation mode.
      Signed-off-by: default avatarHillf Danton <dhillf@gmail.com>
      Acked-by: default avatarKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Acked-by: default avatarKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Reviewed-by: default avatarRik van Riel <riel@redhat.com>
      Cc: David Rientjes <rientjes@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      61317289
    • Konstantin Khlebnikov's avatar
      mm: add rss counters consistency check · c3f0327f
      Konstantin Khlebnikov authored
      Warn about non-zero rss counters at final mmdrop.
      
      This check will prevent reoccurences of bugs such as that fixed in "mm:
      fix rss count leakage during migration".
      
      I didn't hide this check under CONFIG_VM_DEBUG because it rather small and
      rss counters cover whole page-table management, so this is a good
      invariant.
      Signed-off-by: default avatarKonstantin Khlebnikov <khlebnikov@openvz.org>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      c3f0327f
    • David Rientjes's avatar
      mm, oom: introduce independent oom killer ratelimit state · dc3f21ea
      David Rientjes authored
      printk_ratelimit() uses the global ratelimit state for all printks.  The
      oom killer should not be subjected to this state just because another
      subsystem or driver may be flooding the kernel log.
      
      This patch introduces printk ratelimiting specifically for the oom killer.
      Signed-off-by: default avatarDavid Rientjes <rientjes@google.com>
      Acked-by: default avatarKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Reviewed-by: default avatarKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      dc3f21ea
    • David Rientjes's avatar
      mm, oom: do not emit oom killer warning if chosen thread is already exiting · 8447d950
      David Rientjes authored
      If a thread is chosen for oom kill and is already PF_EXITING, then the oom
      killer simply sets TIF_MEMDIE and returns.  This allows the thread to have
      access to memory reserves so that it may quickly exit.  This logic is
      preceeded with a comment saying there's no need to alarm the sysadmin.
      This patch adds truth to that statement.
      
      There's no need to emit any warning about the oom condition if the thread
      is already exiting since it will not be killed.  In this condition, just
      silently return the oom killer since its only giving access to memory
      reserves and is otherwise a no-op.
      Acked-by: default avatarKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Acked-by: default avatarKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Signed-off-by: default avatarDavid Rientjes <rientjes@google.com>
      Acked-by: default avatarMichal Hocko <mhocko@suse.cz>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      8447d950
    • David Rientjes's avatar
      mm, oom: fold oom_kill_task() into oom_kill_process() · 647f2bdf
      David Rientjes authored
      oom_kill_task() has a single caller, so fold it into its parent function,
      oom_kill_process().  Slightly reduces the number of lines in the oom
      killer.
      Acked-by: default avatarKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Reviewed-by: default avatarKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Signed-off-by: default avatarDavid Rientjes <rientjes@google.com>
      Acked-by: default avatarMichal Hocko <mhocko@suse.cz>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      647f2bdf
    • David Rientjes's avatar
      mm, oom: avoid looping when chosen thread detaches its mm · 2a1c9b1f
      David Rientjes authored
      oom_kill_task() returns non-zero iff the chosen process does not have any
      threads with an attached ->mm.
      
      In such a case, it's better to just return to the page allocator and retry
      the allocation because memory could have been freed in the interim and the
      oom condition may no longer exist.  It's unnecessary to loop in the oom
      killer and find another thread to kill.
      
      This allows both oom_kill_task() and oom_kill_process() to be converted to
      void functions.  If the oom condition persists, the oom killer will be
      recalled.
      Acked-by: default avatarKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Signed-off-by: default avatarDavid Rientjes <rientjes@google.com>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Acked-by: default avatarMichal Hocko <mhocko@suse.cz>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      2a1c9b1f
    • Matt Fleming's avatar
      sparc: use block_sigmask() · ce24d8a1
      Matt Fleming authored
      Use the new helper function introduced in commit 5e6292c0 ("signal:
      add block_sigmask() for adding sigmask to current->blocked") which
      centralises the code for updating current->blocked after successfully
      delivering a signal and reduces the amount of duplicate code across
      architectures.  In the past some architectures got this code wrong, so
      using this helper function should stop that from happening again.
      Acked-by: default avatarOleg Nesterov <oleg@redhat.com>
      Acked-by: default avatar"David S. Miller" <davem@davemloft.net>
      Signed-off-by: default avatarMatt Fleming <matt.fleming@intel.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      ce24d8a1
    • Matt Fleming's avatar
      xtensa: use set_current_blocked() and block_sigmask() · d12f7c4a
      Matt Fleming authored
      As described in commit e6fa16ab ("signal: sigprocmask() should do
      retarget_shared_pending()") the modification of current->blocked is
      incorrect as we need to check whether the signal we're about to block is
      pending in the shared queue.
      
      Also, use the new helper function introduced in commit 5e6292c0
      ("signal: add block_sigmask() for adding sigmask to current->blocked")
      which centralises the code for updating current->blocked after
      successfully delivering a signal and reduces the amount of duplicate code
      across architectures.  In the past some architectures got this code wrong,
      so using this helper function should stop that from happening again.
      Acked-by: default avatarOleg Nesterov <oleg@redhat.com>
      Cc: Chris Zankel <chris@zankel.net>
      Signed-off-by: default avatarMatt Fleming <matt.fleming@intel.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      d12f7c4a
    • Matt Fleming's avatar
      xtensa: don't mask signals if we fail to setup signal stack · 3785006a
      Matt Fleming authored
      setup_frame() needs to return an indication of whether it succeeded or
      failed in setting up the signal stack frame.  If setup_frame() fails then
      we must not modify current->blocked.
      Acked-by: default avatarOleg Nesterov <oleg@redhat.com>
      Cc: Chris Zankel <chris@zankel.net>
      Signed-off-by: default avatarMatt Fleming <matt.fleming@intel.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      3785006a
    • Matt Fleming's avatar
      xtensa: no need to reset handler if SA_ONESHOT · ff6d21e7
      Matt Fleming authored
      get_signal_to_deliver() already resets the signal handler if SA_ONESHOT
      is set in ka->sa.sa_flags, there's no need to do it again in
      handle_signal().
      
      Furthermore, because we were modifying ka->sa.sa_handler (which is a
      copy of sighand->action[]) instead of sighand->action[] the original
      code actually had no effect on signal delivery.
      Acked-by: default avatarOleg Nesterov <oleg@redhat.com>
      Cc: Chris Zankel <chris@zankel.net>
      Signed-off-by: default avatarMatt Fleming <matt.fleming@intel.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      ff6d21e7
    • Matt Fleming's avatar
      xtensa: don't reimplement force_sigsegv() · fa47ac59
      Matt Fleming authored
      Instead of open coding the sequence from force_sigsegv() just call it.
      This also fixes a bug because we were modifying ka->sa.sa_handler (which
      is a copy of sighand->action[]), whereas the intention of the code was to
      modify sighand->action[] directly.
      
      As the original code was working with a copy it had no effect on signal
      delivery.
      Acked-by: default avatarOleg Nesterov <oleg@redhat.com>
      Cc: Chris Zankel <chris@zankel.net>
      Signed-off-by: default avatarMatt Fleming <matt.fleming@intel.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      fa47ac59
    • Earl Chew's avatar
      seq_file: fix mishandling of consecutive pread() invocations. · 7904ac84
      Earl Chew authored
      The following program illustrates the problem:
      
          char buf[8192];
      
          int fd = open("/proc/self/maps", O_RDONLY);
      
          n = pread(fd, buf, sizeof(buf), 0);
          printf("%d\n", n);
      
          /* lseek(fd, 0, SEEK_CUR); */ /* Uncomment to work around */
      
          n = pread(fd, buf, sizeof(buf), 0);
          printf("%d\n", n);
      
      The second printf() prints zero, but uncommenting the lseek() corrects its
      behaviour.
      
      To fix, make seq_read() mirror seq_lseek() when processing changes in
      *ppos.  Restore m->version first, then if required traverse and update
      read_pos on success.
      
      Addresses https://bugzilla.kernel.org/show_bug.cgi?id=11856Signed-off-by: default avatarEarl Chew <echew@ixiacom.com>
      Cc: Alexey Dobriyan <adobriyan@gmail.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      7904ac84
    • Marcos Paulo de Souza's avatar
      drivers/idle/intel_idle.c: fix confusing code identation · dc716e96
      Marcos Paulo de Souza authored
      Fix a code indentation in the function intel_idle_cpu_init that looks
      confusing.o
      Suggested-by: default avatarSrivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com>
      Reviewed-by: default avatarSrivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com>
      Signed-off-by: default avatarMarcos Paulo de Souza <marcos.mage@gmail.com>
      Cc: "Brown, Len" <len.brown@intel.com>
      Cc: Len Brown <lenb@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      dc716e96
    • Andrew Morton's avatar
      fs/namei.c: fix warnings on 32-bit · 1de5b41c
      Andrew Morton authored
      i386 allnoconfig:
      
        fs/namei.c: In function 'has_zero':
        fs/namei.c:1617: warning: integer constant is too large for 'unsigned long' type
        fs/namei.c:1617: warning: integer constant is too large for 'unsigned long' type
        fs/namei.c: In function 'hash_name':
        fs/namei.c:1635: warning: integer constant is too large for 'unsigned long' type
      
      There must be a tidier way of doing this.
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      1de5b41c
    • Andrea Arcangeli's avatar
      mm: thp: fix pmd_bad() triggering in code paths holding mmap_sem read mode · 1a5a9906
      Andrea Arcangeli authored
      In some cases it may happen that pmd_none_or_clear_bad() is called with
      the mmap_sem hold in read mode.  In those cases the huge page faults can
      allocate hugepmds under pmd_none_or_clear_bad() and that can trigger a
      false positive from pmd_bad() that will not like to see a pmd
      materializing as trans huge.
      
      It's not khugepaged causing the problem, khugepaged holds the mmap_sem
      in write mode (and all those sites must hold the mmap_sem in read mode
      to prevent pagetables to go away from under them, during code review it
      seems vm86 mode on 32bit kernels requires that too unless it's
      restricted to 1 thread per process or UP builds).  The race is only with
      the huge pagefaults that can convert a pmd_none() into a
      pmd_trans_huge().
      
      Effectively all these pmd_none_or_clear_bad() sites running with
      mmap_sem in read mode are somewhat speculative with the page faults, and
      the result is always undefined when they run simultaneously.  This is
      probably why it wasn't common to run into this.  For example if the
      madvise(MADV_DONTNEED) runs zap_page_range() shortly before the page
      fault, the hugepage will not be zapped, if the page fault runs first it
      will be zapped.
      
      Altering pmd_bad() not to error out if it finds hugepmds won't be enough
      to fix this, because zap_pmd_range would then proceed to call
      zap_pte_range (which would be incorrect if the pmd become a
      pmd_trans_huge()).
      
      The simplest way to fix this is to read the pmd in the local stack
      (regardless of what we read, no need of actual CPU barriers, only
      compiler barrier needed), and be sure it is not changing under the code
      that computes its value.  Even if the real pmd is changing under the
      value we hold on the stack, we don't care.  If we actually end up in
      zap_pte_range it means the pmd was not none already and it was not huge,
      and it can't become huge from under us (khugepaged locking explained
      above).
      
      All we need is to enforce that there is no way anymore that in a code
      path like below, pmd_trans_huge can be false, but pmd_none_or_clear_bad
      can run into a hugepmd.  The overhead of a barrier() is just a compiler
      tweak and should not be measurable (I only added it for THP builds).  I
      don't exclude different compiler versions may have prevented the race
      too by caching the value of *pmd on the stack (that hasn't been
      verified, but it wouldn't be impossible considering
      pmd_none_or_clear_bad, pmd_bad, pmd_trans_huge, pmd_none are all inlines
      and there's no external function called in between pmd_trans_huge and
      pmd_none_or_clear_bad).
      
      		if (pmd_trans_huge(*pmd)) {
      			if (next-addr != HPAGE_PMD_SIZE) {
      				VM_BUG_ON(!rwsem_is_locked(&tlb->mm->mmap_sem));
      				split_huge_page_pmd(vma->vm_mm, pmd);
      			} else if (zap_huge_pmd(tlb, vma, pmd, addr))
      				continue;
      			/* fall through */
      		}
      		if (pmd_none_or_clear_bad(pmd))
      
      Because this race condition could be exercised without special
      privileges this was reported in CVE-2012-1179.
      
      The race was identified and fully explained by Ulrich who debugged it.
      I'm quoting his accurate explanation below, for reference.
      
      ====== start quote =======
            mapcount 0 page_mapcount 1
            kernel BUG at mm/huge_memory.c:1384!
      
          At some point prior to the panic, a "bad pmd ..." message similar to the
          following is logged on the console:
      
            mm/memory.c:145: bad pmd ffff8800376e1f98(80000000314000e7).
      
          The "bad pmd ..." message is logged by pmd_clear_bad() before it clears
          the page's PMD table entry.
      
              143 void pmd_clear_bad(pmd_t *pmd)
              144 {
          ->  145         pmd_ERROR(*pmd);
              146         pmd_clear(pmd);
              147 }
      
          After the PMD table entry has been cleared, there is an inconsistency
          between the actual number of PMD table entries that are mapping the page
          and the page's map count (_mapcount field in struct page). When the page
          is subsequently reclaimed, __split_huge_page() detects this inconsistency.
      
             1381         if (mapcount != page_mapcount(page))
             1382                 printk(KERN_ERR "mapcount %d page_mapcount %d\n",
             1383                        mapcount, page_mapcount(page));
          -> 1384         BUG_ON(mapcount != page_mapcount(page));
      
          The root cause of the problem is a race of two threads in a multithreaded
          process. Thread B incurs a page fault on a virtual address that has never
          been accessed (PMD entry is zero) while Thread A is executing an madvise()
          system call on a virtual address within the same 2 MB (huge page) range.
      
                     virtual address space
                    .---------------------.
                    |                     |
                    |                     |
                  .-|---------------------|
                  | |                     |
                  | |                     |<-- B(fault)
                  | |                     |
            2 MB  | |/////////////////////|-.
            huge <  |/////////////////////|  > A(range)
            page  | |/////////////////////|-'
                  | |                     |
                  | |                     |
                  '-|---------------------|
                    |                     |
                    |                     |
                    '---------------------'
      
          - Thread A is executing an madvise(..., MADV_DONTNEED) system call
            on the virtual address range "A(range)" shown in the picture.
      
          sys_madvise
            // Acquire the semaphore in shared mode.
            down_read(&current->mm->mmap_sem)
            ...
            madvise_vma
              switch (behavior)
              case MADV_DONTNEED:
                   madvise_dontneed
                     zap_page_range
                       unmap_vmas
                         unmap_page_range
                           zap_pud_range
                             zap_pmd_range
                               //
                               // Assume that this huge page has never been accessed.
                               // I.e. content of the PMD entry is zero (not mapped).
                               //
                               if (pmd_trans_huge(*pmd)) {
                                   // We don't get here due to the above assumption.
                               }
                               //
                               // Assume that Thread B incurred a page fault and
                   .---------> // sneaks in here as shown below.
                   |           //
                   |           if (pmd_none_or_clear_bad(pmd))
                   |               {
                   |                 if (unlikely(pmd_bad(*pmd)))
                   |                     pmd_clear_bad
                   |                     {
                   |                       pmd_ERROR
                   |                         // Log "bad pmd ..." message here.
                   |                       pmd_clear
                   |                         // Clear the page's PMD entry.
                   |                         // Thread B incremented the map count
                   |                         // in page_add_new_anon_rmap(), but
                   |                         // now the page is no longer mapped
                   |                         // by a PMD entry (-> inconsistency).
                   |                     }
                   |               }
                   |
                   v
          - Thread B is handling a page fault on virtual address "B(fault)" shown
            in the picture.
      
          ...
          do_page_fault
            __do_page_fault
              // Acquire the semaphore in shared mode.
              down_read_trylock(&mm->mmap_sem)
              ...
              handle_mm_fault
                if (pmd_none(*pmd) && transparent_hugepage_enabled(vma))
                    // We get here due to the above assumption (PMD entry is zero).
                    do_huge_pmd_anonymous_page
                      alloc_hugepage_vma
                        // Allocate a new transparent huge page here.
                      ...
                      __do_huge_pmd_anonymous_page
                        ...
                        spin_lock(&mm->page_table_lock)
                        ...
                        page_add_new_anon_rmap
                          // Here we increment the page's map count (starts at -1).
                          atomic_set(&page->_mapcount, 0)
                        set_pmd_at
                          // Here we set the page's PMD entry which will be cleared
                          // when Thread A calls pmd_clear_bad().
                        ...
                        spin_unlock(&mm->page_table_lock)
      
          The mmap_sem does not prevent the race because both threads are acquiring
          it in shared mode (down_read).  Thread B holds the page_table_lock while
          the page's map count and PMD table entry are updated.  However, Thread A
          does not synchronize on that lock.
      
      ====== end quote =======
      
      [akpm@linux-foundation.org: checkpatch fixes]
      Reported-by: default avatarUlrich Obergfell <uobergfe@redhat.com>
      Signed-off-by: default avatarAndrea Arcangeli <aarcange@redhat.com>
      Acked-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Dave Jones <davej@redhat.com>
      Acked-by: default avatarLarry Woodman <lwoodman@redhat.com>
      Acked-by: default avatarRik van Riel <riel@redhat.com>
      Cc: <stable@vger.kernel.org>		[2.6.38+]
      Cc: Mark Salter <msalter@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      1a5a9906
  2. 21 Mar, 2012 8 commits
    • Linus Torvalds's avatar
      Merge tag 'hwmon-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/groeck/linux-staging · 31f67652
      Linus Torvalds authored
      Pull hwmon changes for v3.4 from Guenter Roeck:
       "Mostly cleanup.  No new drivers this time around, but support for
        several chips added to existing drivers: TPS40400, TPS40422, MTD040,
        MAX34446, ZL9101M, ZL9117M, and LM96080.  Also, added watchdog support
        for SCH56xx, and additional attributes for a couple of drivers."
      
      * tag 'hwmon-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/groeck/linux-staging: (137 commits)
        hwmon: (sch56xx) Add support for the integrated watchdog (v2)
        hwmon: (w83627ehf) Add support for temperature offset registers
        hwmon: (jc42) Remove unnecessary device IDs
        hwmon: (zl6100) Add support for ZL9101M and ZL9117M
        hwmon: (adm1275) Add support for ADM1075
        hwmon: (max34440) Add support for MAX34446
        hwmon: (pmbus) Add more virtual registers
        hwmon: (pmbus) Add support for Lineage Power MDT040
        hwmon: (pmbus) Add support for TI TPS40400 and TPS40422
        hwmon: (max34440) Add support for 'lowest' output voltage attribute
        hwmon: (jc42) Convert to use devm_kzalloc
        hwmon: (max16065) Convert to use devm_kzalloc
        hwmon: (smm665) Convert to use devm_kzalloc
        hwmon: (ltc4261) Convert to use devm_kzalloc
        hwmon: (pmbus) Simplify remove functions
        hwmon: (pmbus) Convert pmbus drivers to use devm_kzalloc
        hwmon: (lineage-pem) Convert to use devm_kzalloc
        hwmon: (hwmon-vid) Fix checkpatch issues
        hwmon: (hwmon-vid) Add new entries to VRM model table
        hwmon: (lm80) Add detection of NatSemi/TI LM96080
        ...
      31f67652
    • Linus Torvalds's avatar
      Merge tag 'regulator-3.4' of git://git.kernel.org/pub/scm/linux/kernel/git/broonie/regulator · d15d7644
      Linus Torvalds authored
      Pull regulator updates for 3.4 from Mark Brown:
       "This has been a fairly quiet release from a regulator point of view,
        the only real framework features added were devm support and a
        convenience helper for setting up fixed voltage regulators.
      
        We also added a couple of drivers (but will drop the BQ240022 driver
        via the arm-soc tree as it's been replaced by the more generic
        gpio-regulator driver) and Axel Lin continued his relentless and
        generally awesome stream of fixes and cleanups."
      
      * tag 'regulator-3.4' of git://git.kernel.org/pub/scm/linux/kernel/git/broonie/regulator: (93 commits)
        regulator: Fix up a confusing dev_warn when DT lookup fails
        regulator: Convert tps6507x to set_voltage_sel
        regulator: Refactor tps6507x to use one tps6507x_pmic_ops for all LDOs and DCDCs
        regulator: Make s5m8767_get_voltage_register always return correct register
        regulator: s5m8767: Check pdata->buck[2|3|4]_gpiodvs earlier
        regulator: tps65910: Provide settling time for DCDC voltage change
        regulator: Add Anatop regulator driver
        regulator: Simplify implementation of tps65912_get_voltage_dcdc
        regulator: Use tps65912_set_voltage_sel for both DCDCx and LDOx
        regulator: tps65910: Provide settling time for enabling rails
        regulator: max8925: Use DIV_ROUND_UP macro
        regulator: tps65912: Use simple equations to get register address
        regulator: Fix the logic of tps65910_get_mode
        regulator: Merge tps65217_pmic_ldo234_ops and tps65217_pmic_dcdc_ops to tps65217_pmic_ops
        regulator: Use DIV_ROUND_CLOSEST in wm8350_isink_get_current
        regulator: Use array to store dcdc_range settings for tps65912
        regulator: Rename s5m8767_convert_voltage to s5m8767_convert_voltage_to_sel
        regulator: tps6524x: Remove unneeded comment for N_REGULATORS
        regulator: Rename set_voltage_sel callback function name to *_sel
        regulator: Fix s5m8767_set_voltage_time_sel calculation value
        ...
      d15d7644
    • Linus Torvalds's avatar
      Merge tag 'rdma-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/roland/infiniband · 0c2fe82a
      Linus Torvalds authored
      Pull InfiniBand/RDMA changes for the 3.4 merge window from Roland Dreier:
       "Nothing big really stands out; by patch count lots of fixes to the
        mlx4 driver plus some cleanups and fixes to the core and other
        drivers."
      
      * tag 'rdma-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/roland/infiniband: (28 commits)
        mlx4_core: Scale size of MTT table with system RAM
        mlx4_core: Allow dynamic MTU configuration for IB ports
        IB/mlx4: Fix info returned when querying IBoE ports
        IB/mlx4: Fix possible missed completion event
        mlx4_core: Report thermal error events
        mlx4_core: Fix one more static exported function
        IB: Change CQE "csum_ok" field to a bit flag
        RDMA/iwcm: Reject connect requests if cmid is not in LISTEN state
        RDMA/cxgb3: Don't pass irq flags to flush_qp()
        mlx4_core: Get rid of redundant ext_port_cap flags
        RDMA/ucma: Fix AB-BA deadlock
        IB/ehca: Fix ilog2() compile failure
        IB: Use central enum for speed instead of hard-coded values
        IB/iser: Post initial receive buffers before sending the final login request
        IB/iser: Free IB connection resources in the proper place
        IB/srp: Consolidate repetitive sysfs code
        IB/srp: Use pr_fmt() and pr_err()/pr_warn()
        IB/core: Fix SDR rates in sysfs
        mlx4: Enforce device max FMR maps in FMR alloc
        IB/mlx4: Set bad_wr for invalid send opcode
        ...
      0c2fe82a
    • Linus Torvalds's avatar
      Merge tag 'spi-for-linus' of git://git.secretlab.ca/git/linux-2.6 · 5f0e685f
      Linus Torvalds authored
      Pull SPI changes for v3.4 from Grant Likely:
       "Mostly a bunch of new drivers and driver bug fixes; but this also
        includes a few patches that create a core message queue infrastructure
        for the spi subsystem instead of making each driver open code it."
      
      * tag 'spi-for-linus' of git://git.secretlab.ca/git/linux-2.6: (34 commits)
        spi/fsl-espi: Make sure pm is within 2..32
        spi/fsl-espi: make the clock computation easier to read
        spi: sh-hspi: modify write/read method
        spi: sh-hspi: control spi clock more correctly
        spi: sh-hspi: convert to using core message queue
        spi: s3c64xx: Fix build
        spi: s3c64xx: remove unnecessary callback msg->complete
        spi: remove redundant variable assignment
        spi: release lock on error path in spi_pump_messages()
        spi: Compatibility with direction which is used in samsung DMA operation
        spi-topcliff-pch: add recovery processing in case wait-event timeout
        spi-topcliff-pch: supports a spi mode setup and bit order setup by IO control
        spi-topcliff-pch: Fix issue for transmitting over 4KByte
        spi-topcliff-pch: Modify pci-bus number dynamically to get DMA device info
        spi/imx: simplify error handling to free gpios
        spi: Convert to DEFINE_PCI_DEVICE_TABLE
        spi: add Broadcom BCM63xx SPI controller driver
        SPI: add CSR SiRFprimaII SPI controller driver
        spi-topcliff-pch: fix -Wuninitialized warning
        spi: Mark spi_register_board_info() __devinit
        ...
      5f0e685f
    • Linus Torvalds's avatar
      Merge tag 'dt-for-linus' of git://git.secretlab.ca/git/linux-2.6 · f8974cb7
      Linus Torvalds authored
      Pull core device tree changes for Linux v3.4 from Grant Likely:
       "This branch contains a minor documentation addition, a utility
        function for parsing string properties needed by some of the new ARM
        platforms, disables dynamic DT code that isn't used anywhere but on a
        few PPC machines, and exports DT node compatible data to userspace via
        UEVENT properties.  Nothing earth shattering here."
      
      * tag 'dt-for-linus' of git://git.secretlab.ca/git/linux-2.6:
        of: Only compile OF_DYNAMIC on PowerPC pseries and iseries
        arm/dts: OMAP3: Add omap3evm and am335xevm support
        drivercore: Output common devicetree information in uevent
        of: Add of_property_match_string() to find index into a string list
      f8974cb7
    • Linus Torvalds's avatar
      Merge tag 'irqdomain-for-linus' of git://git.secretlab.ca/git/linux-2.6 · c207f3a4
      Linus Torvalds authored
      Pull irq_domain support for all architectures from Grant Likely:
       "Generialize powerpc's irq_host as irq_domain
      
        This branch takes the PowerPC irq_host infrastructure (reverse mapping
        from Linux IRQ numbers to hardware irq numbering), generalizes it,
        renames it to irq_domain, and makes it available to all architectures.
      
        Originally the plan has been to create an all-new irq_domain
        implementation which addresses some of the powerpc shortcomings such
        as not handling 1:1 mappings well, but doing that proved to be far
        more difficult and invasive than generalizing the working code and
        refactoring it in-place.  So, this branch rips out the 'new'
        irq_domain and replaces it with the modified powerpc version (in a
        fully bisectable way of course).  It converts all users over to the
        new API and makes irq_domain selectable on any architecture.
      
        No architecture is forced to enable irq_domain, but the infrastructure
        is required for doing OpenFirmware style irq translations.  It will
        even work on SPARC even though SPARC has it's own mechanism for
        translating irqs at boot time.  MIPS, microblaze, embedded x86 and c6x
        are converted too.
      
        The resulting irq_domain code is probably still too verbose and can be
        optimized more, but that can be done incrementally and is a task for
        follow-on patches."
      
      * tag 'irqdomain-for-linus' of git://git.secretlab.ca/git/linux-2.6: (31 commits)
        dt: fix twl4030 for non-dt compile on x86
        mfd: twl-core: Add IRQ_DOMAIN dependency
        devicetree: Add empty of_platform_populate() for !CONFIG_OF_ADDRESS (sparc)
        irq_domain: Centralize definition of irq_dispose_mapping()
        irq_domain/mips: Allow irq_domain on MIPS
        irq_domain/x86: Convert x86 (embedded) to use common irq_domain
        ppc-6xx: fix build failure in flipper-pic.c and hlwd-pic.c
        irq_domain/microblaze: Convert microblaze to use irq_domains
        irq_domain/powerpc: Replace custom xlate functions with library functions
        irq_domain/powerpc: constify irq_domain_ops
        irq_domain/c6x: Use library of xlate functions
        irq_domain/c6x: constify irq_domain structures
        irq_domain/c6x: Convert c6x to use generic irq_domain support.
        irq_domain: constify irq_domain_ops
        irq_domain: Create common xlate functions that device drivers can use
        irq_domain: Remove irq_domain_add_simple()
        irq_domain: Remove 'new' irq_domain in favour of the ppc one
        mfd: twl-core.c: Fix the number of interrupts managed by twl4030
        of/address: add empty static inlines for !CONFIG_OF
        irq_domain: Add support for base irq and hwirq in legacy mappings
        ...
      c207f3a4
    • Linus Torvalds's avatar
      Merge tag 'pm-for-3.4' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm · c7c66c0c
      Linus Torvalds authored
      Pull power management updates for 3.4 from Rafael Wysocki:
       "Assorted extensions and fixes including:
      
        * Introduction of early/late suspend/hibernation device callbacks.
        * Generic PM domains extensions and fixes.
        * devfreq updates from Axel Lin and MyungJoo Ham.
        * Device PM QoS updates.
        * Fixes of concurrency problems with wakeup sources.
        * System suspend and hibernation fixes."
      
      * tag 'pm-for-3.4' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm: (43 commits)
        PM / Domains: Check domain status during hibernation restore of devices
        PM / devfreq: add relation of recommended frequency.
        PM / shmobile: Make MTU2 driver use pm_genpd_dev_always_on()
        PM / shmobile: Make CMT driver use pm_genpd_dev_always_on()
        PM / shmobile: Make TMU driver use pm_genpd_dev_always_on()
        PM / Domains: Introduce "always on" device flag
        PM / Domains: Fix hibernation restore of devices, v2
        PM / Domains: Fix handling of wakeup devices during system resume
        sh_mmcif / PM: Use PM QoS latency constraint
        tmio_mmc / PM: Use PM QoS latency constraint
        PM / QoS: Make it possible to expose PM QoS latency constraints
        PM / Sleep: JBD and JBD2 missing set_freezable()
        PM / Domains: Fix include for PM_GENERIC_DOMAINS=n case
        PM / Freezer: Remove references to TIF_FREEZE in comments
        PM / Sleep: Add more wakeup source initialization routines
        PM / Hibernate: Enable usermodehelpers in hibernate() error path
        PM / Sleep: Make __pm_stay_awake() delete wakeup source timers
        PM / Sleep: Fix race conditions related to wakeup source timer function
        PM / Sleep: Fix possible infinite loop during wakeup source destruction
        PM / Hibernate: print physical addresses consistently with other parts of kernel
        ...
      c7c66c0c
    • Linus Torvalds's avatar
      Merge branch 'kmap_atomic' of git://github.com/congwang/linux · 9f393834
      Linus Torvalds authored
      Pull kmap_atomic cleanup from Cong Wang.
      
      It's been in -next for a long time, and it gets rid of the (no longer
      used) second argument to k[un]map_atomic().
      
      Fix up a few trivial conflicts in various drivers, and do an "evil
      merge" to catch some new uses that have come in since Cong's tree.
      
      * 'kmap_atomic' of git://github.com/congwang/linux: (59 commits)
        feature-removal-schedule.txt: schedule the deprecated form of kmap_atomic() for removal
        highmem: kill all __kmap_atomic() [swarren@nvidia.com: highmem: Fix ARM build break due to __kmap_atomic rename]
        drbd: remove the second argument of k[un]map_atomic()
        zcache: remove the second argument of k[un]map_atomic()
        gma500: remove the second argument of k[un]map_atomic()
        dm: remove the second argument of k[un]map_atomic()
        tomoyo: remove the second argument of k[un]map_atomic()
        sunrpc: remove the second argument of k[un]map_atomic()
        rds: remove the second argument of k[un]map_atomic()
        net: remove the second argument of k[un]map_atomic()
        mm: remove the second argument of k[un]map_atomic()
        lib: remove the second argument of k[un]map_atomic()
        power: remove the second argument of k[un]map_atomic()
        kdb: remove the second argument of k[un]map_atomic()
        udf: remove the second argument of k[un]map_atomic()
        ubifs: remove the second argument of k[un]map_atomic()
        squashfs: remove the second argument of k[un]map_atomic()
        reiserfs: remove the second argument of k[un]map_atomic()
        ocfs2: remove the second argument of k[un]map_atomic()
        ntfs: remove the second argument of k[un]map_atomic()
        ...
      9f393834