1. 10 Jun, 2013 40 commits
    • Mel Gorman's avatar
      mempolicy: fix a race in shared_policy_replace() · aaa1bc47
      Mel Gorman authored
      commit b22d127a upstream.
      
      shared_policy_replace() use of sp_alloc() is unsafe.  1) sp_node cannot
      be dereferenced if sp->lock is not held and 2) another thread can modify
      sp_node between spin_unlock for allocating a new sp node and next
      spin_lock.  The bug was introduced before 2.6.12-rc2.
      
      Kosaki's original patch for this problem was to allocate an sp node and
      policy within shared_policy_replace and initialise it when the lock is
      reacquired.  I was not keen on this approach because it partially
      duplicates sp_alloc().  As the paths were sp->lock is taken are not that
      performance critical this patch converts sp->lock to sp->mutex so it can
      sleep when calling sp_alloc().
      
      [kosaki.motohiro@jp.fujitsu.com: Original patch]
      Signed-off-by: default avatarMel Gorman <mgorman@suse.de>
      Acked-by: default avatarKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Reviewed-by: default avatarChristoph Lameter <cl@linux.com>
      Cc: Josh Boyer <jwboyer@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      Signed-off-by: default avatarWilly Tarreau <w@1wt.eu>
      aaa1bc47
    • Hugh Dickins's avatar
      mm: fix invalidate_complete_page2() lock ordering · bc68657f
      Hugh Dickins authored
      commit ec4d9f62 upstream.
      
      In fuzzing with trinity, lockdep protested "possible irq lock inversion
      dependency detected" when isolate_lru_page() reenabled interrupts while
      still holding the supposedly irq-safe tree_lock:
      
      invalidate_inode_pages2
        invalidate_complete_page2
          spin_lock_irq(&mapping->tree_lock)
          clear_page_mlock
            isolate_lru_page
              spin_unlock_irq(&zone->lru_lock)
      
      isolate_lru_page() is correct to enable interrupts unconditionally:
      invalidate_complete_page2() is incorrect to call clear_page_mlock() while
      holding tree_lock, which is supposed to nest inside lru_lock.
      
      Both truncate_complete_page() and invalidate_complete_page() call
      clear_page_mlock() before taking tree_lock to remove page from radix_tree.
       I guess invalidate_complete_page2() preferred to test PageDirty (again)
      under tree_lock before committing to the munlock; but since the page has
      already been unmapped, its state is already somewhat inconsistent, and no
      worse if clear_page_mlock() moved up.
      Reported-by: default avatarSasha Levin <levinsasha928@gmail.com>
      Deciphered-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarHugh Dickins <hughd@google.com>
      Acked-by: default avatarMel Gorman <mel@csn.ul.ie>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Michel Lespinasse <walken@google.com>
      Cc: Ying Han <yinghan@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      Signed-off-by: default avatarWilly Tarreau <w@1wt.eu>
      bc68657f
    • Takamori Yamaguchi's avatar
      mm: bugfix: set current->reclaim_state to NULL while returning from kswapd() · 61daa923
      Takamori Yamaguchi authored
      commit b0a8cc58 upstream.
      
      In kswapd(), set current->reclaim_state to NULL before returning, as
      current->reclaim_state holds reference to variable on kswapd()'s stack.
      
      In rare cases, while returning from kswapd() during memory offlining,
      __free_slab() and freepages() can access the dangling pointer of
      current->reclaim_state.
      Signed-off-by: default avatarTakamori Yamaguchi <takamori.yamaguchi@jp.sony.com>
      Signed-off-by: default avatarAaditya Kumar <aaditya.kumar@ap.sony.com>
      Acked-by: default avatarDavid Rientjes <rientjes@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      Signed-off-by: default avatarWilly Tarreau <w@1wt.eu>
      61daa923
    • Christoffer Dall's avatar
      mm: Fix PageHead when !CONFIG_PAGEFLAGS_EXTENDED · 6d9f42be
      Christoffer Dall authored
      commit ad4b3fb7 upstream.
      
      Unfortunately with !CONFIG_PAGEFLAGS_EXTENDED, (!PageHead) is false, and
      (PageHead) is true, for tail pages.  If this is indeed the intended
      behavior, which I doubt because it breaks cache cleaning on some ARM
      systems, then the nomenclature is highly problematic.
      
      This patch makes sure PageHead is only true for head pages and PageTail
      is only true for tail pages, and neither is true for non-compound pages.
      
      [ This buglet seems ancient - seems to have been introduced back in Apr
        2008 in commit 6a1e7f77: "pageflags: convert to the use of new
        macros".  And the reason nobody noticed is because the PageHead()
        tests are almost all about just sanity-checking, and only used on
        pages that are actual page heads.  The fact that the old code returned
        true for tail pages too was thus not really noticeable.   - Linus ]
      Signed-off-by: default avatarChristoffer Dall <cdall@cs.columbia.edu>
      Acked-by: default avatarAndrea Arcangeli <aarcange@redhat.com>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Will Deacon <Will.Deacon@arm.com>
      Cc: Steve Capper <Steve.Capper@arm.com>
      Cc: Christoph Lameter <cl@linux.com>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      Signed-off-by: default avatarWilly Tarreau <w@1wt.eu>
      6d9f42be
    • Dave Hansen's avatar
      mm: fix vma_resv_map() NULL pointer · 6b36635b
      Dave Hansen authored
      commit 4523e145 upstream
      
      hugetlb_reserve_pages() can be used for either normal file-backed
      hugetlbfs mappings, or MAP_HUGETLB.  In the MAP_HUGETLB, semi-anonymous
      mode, there is not a VMA around.  The new call to resv_map_put() assumed
      that there was, and resulted in a NULL pointer dereference:
      
        BUG: unable to handle kernel NULL pointer dereference at 0000000000000030
        IP: vma_resv_map+0x9/0x30
        PGD 141453067 PUD 1421e1067 PMD 0
        Oops: 0000 [#1] PREEMPT SMP
        ...
        Pid: 14006, comm: trinity-child6 Not tainted 3.4.0+ #36
        RIP: vma_resv_map+0x9/0x30
        ...
        Process trinity-child6 (pid: 14006, threadinfo ffff8801414e0000, task ffff8801414f26b0)
        Call Trace:
          resv_map_put+0xe/0x40
          hugetlb_reserve_pages+0xa6/0x1d0
          hugetlb_file_setup+0x102/0x2c0
          newseg+0x115/0x360
          ipcget+0x1ce/0x310
          sys_shmget+0x5a/0x60
          system_call_fastpath+0x16/0x1b
      
      This was reported by Dave Jones, but was reproducible with the
      libhugetlbfs test cases, so shame on me for not running them in the
      first place.
      
      With this, the oops is gone, and the output of libhugetlbfs's
      run_tests.py is identical to plain 3.4 again.
      
      [ Marked for stable, since this was introduced by commit c50ac050
        ("hugetlb: fix resv_map leak in error path") which was also marked for
        stable ]
      Reported-by: default avatarDave Jones <davej@redhat.com>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      [dannf: backported to Debian's 2.6.32]
      Signed-off-by: default avatarWilly Tarreau <w@1wt.eu>
      6b36635b
    • Dave Hansen's avatar
      hugetlb: fix resv_map leak in error path · bc89fc4a
      Dave Hansen authored
      commit c50ac050 upstream
      
      When called for anonymous (non-shared) mappings, hugetlb_reserve_pages()
      does a resv_map_alloc().  It depends on code in hugetlbfs's
      vm_ops->close() to release that allocation.
      
      However, in the mmap() failure path, we do a plain unmap_region() without
      the remove_vma() which actually calls vm_ops->close().
      
      This is a decent fix.  This leak could get reintroduced if new code (say,
      after hugetlb_reserve_pages() in hugetlbfs_file_mmap()) decides to return
      an error.  But, I think it would have to unroll the reservation anyway.
      
      Christoph's test case:
      
      	http://marc.info/?l=linux-mm&m=133728900729735
      
      This patch applies to 3.4 and later.  A version for earlier kernels is at
      https://lkml.org/lkml/2012/5/22/418.
      Signed-off-by: default avatarDave Hansen <dave@linux.vnet.ibm.com>
      Acked-by: default avatarMel Gorman <mel@csn.ul.ie>
      Acked-by: default avatarKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Reported-by: default avatarChristoph Lameter <cl@linux.com>
      Tested-by: default avatarChristoph Lameter <cl@linux.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      [dannf: backported to Debian's 2.6.32]
      Signed-off-by: default avatarWilly Tarreau <w@1wt.eu>
      bc89fc4a
    • Namhyung Kim's avatar
      tracing: Fix double free when function profile init failed · 83c84290
      Namhyung Kim authored
      commit 83e03b3f upstream.
      
      On the failure path, stat->start and stat->pages will refer same page.
      So it'll attempt to free the same page again and get kernel panic.
      
      Link: http://lkml.kernel.org/r/1364820385-32027-1-git-send-email-namhyung@kernel.orgSigned-off-by: default avatarNamhyung Kim <namhyung@kernel.org>
      Cc: Frederic Weisbecker <fweisbec@gmail.com>
      Cc: Namhyung Kim <namhyung.kim@lge.com>
      Signed-off-by: default avatarSteven Rostedt <rostedt@goodmis.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      Signed-off-by: default avatarWilly Tarreau <w@1wt.eu>
      83c84290
    • Wen Congyang's avatar
      tracing: Don't call page_to_pfn() if page is NULL · a2cefc7b
      Wen Congyang authored
      commit 85f2a2ef upstream.
      
      When allocating memory fails, page is NULL. page_to_pfn() will
      cause the kernel panicked if we don't use sparsemem vmemmap.
      
      Link: http://lkml.kernel.org/r/505AB1FF.8020104@cn.fujitsu.comAcked-by: default avatarMel Gorman <mel@csn.ul.ie>
      Cc: Frederic Weisbecker <fweisbec@gmail.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Reviewed-by: default avatarMinchan Kim <minchan@kernel.org>
      Signed-off-by: default avatarWen Congyang <wency@cn.fujitsu.com>
      Signed-off-by: default avatarSteven Rostedt <rostedt@goodmis.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      Signed-off-by: default avatarWilly Tarreau <w@1wt.eu>
      a2cefc7b
    • Li Zhong's avatar
      Fix a dead loop in async_synchronize_full() · 48bac809
      Li Zhong authored
      [Fixed upstream by commits 2955b47d and
      a4683487 from Dan Williams, but they are much
      more intrusive than this tiny fix, according to Andrew - gregkh]
      
      This patch tries to fix a dead loop in  async_synchronize_full(), which
      could be seen when preemption is disabled on a single cpu machine.
      
      void async_synchronize_full(void)
      {
              do {
                      async_synchronize_cookie(next_cookie);
              } while (!list_empty(&async_running) || !
      list_empty(&async_pending));
      }
      
      async_synchronize_cookie() calls async_synchronize_cookie_domain() with
      &async_running as the default domain to synchronize.
      
      However, there might be some works in the async_pending list from other
      domains. On a single cpu system, without preemption, there is no chance
      for the other works to finish, so async_synchronize_full() enters a dead
      loop.
      
      It seems async_synchronize_full() wants to synchronize all entries in
      all running lists(domains), so maybe we could just check the entry_count
      to know whether all works are finished.
      
      Currently, async_synchronize_cookie_domain() expects a non-NULL running
      list ( if NULL, there would be NULL pointer dereference ), so maybe a
      NULL pointer could be used as an indication for the functions to
      synchronize all works in all domains.
      Reported-by: default avatarPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      Signed-off-by: default avatarLi Zhong <zhong@linux.vnet.ibm.com>
      Tested-by: default avatarPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      Tested-by: default avatarChristian Kujau <lists@nerdbynature.de>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Dan Williams <dan.j.williams@gmail.com>
      Cc: Christian Kujau <lists@nerdbynature.de>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Cong Wang <xiyou.wangcong@gmail.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      Signed-off-by: default avatarWilly Tarreau <w@1wt.eu>
      48bac809
    • Tejun Heo's avatar
      cgroup: remove incorrect dget/dput() pair in cgroup_create_dir() · 5e224edf
      Tejun Heo authored
      commit 17543163 upstream.
      
      cgroup_create_dir() does weird dancing with dentry refcnt.  On
      success, it gets and then puts it achieving nothing.  On failure, it
      puts but there isn't no matching get anywhere leading to the following
      oops if cgroup_create_file() fails for whatever reason.
      
        ------------[ cut here ]------------
        kernel BUG at /work/os/work/fs/dcache.c:552!
        invalid opcode: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC
        Modules linked in:
        CPU 2
        Pid: 697, comm: mkdir Not tainted 3.7.0-rc4-work+ #3 Bochs Bochs
        RIP: 0010:[<ffffffff811d9c0c>]  [<ffffffff811d9c0c>] dput+0x1dc/0x1e0
        RSP: 0018:ffff88001a3ebef8  EFLAGS: 00010246
        RAX: 0000000000000000 RBX: ffff88000e5b1ef8 RCX: 0000000000000403
        RDX: 0000000000000303 RSI: 2000000000000000 RDI: ffff88000e5b1f58
        RBP: ffff88001a3ebf18 R08: ffffffff82c76960 R09: 0000000000000001
        R10: ffff880015022080 R11: ffd9bed70f48a041 R12: 00000000ffffffea
        R13: 0000000000000001 R14: ffff88000e5b1f58 R15: 00007fff57656d60
        FS:  00007ff05fcb3800(0000) GS:ffff88001fd00000(0000) knlGS:0000000000000000
        CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
        CR2: 00000000004046f0 CR3: 000000001315f000 CR4: 00000000000006e0
        DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
        DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
        Process mkdir (pid: 697, threadinfo ffff88001a3ea000, task ffff880015022080)
        Stack:
         ffff88001a3ebf48 00000000ffffffea 0000000000000001 0000000000000000
         ffff88001a3ebf38 ffffffff811cc889 0000000000000001 ffff88000e5b1ef8
         ffff88001a3ebf68 ffffffff811d1fc9 ffff8800198d7f18 ffff880019106ef8
        Call Trace:
         [<ffffffff811cc889>] done_path_create+0x19/0x50
         [<ffffffff811d1fc9>] sys_mkdirat+0x59/0x80
         [<ffffffff811d2009>] sys_mkdir+0x19/0x20
         [<ffffffff81be1e02>] system_call_fastpath+0x16/0x1b
        Code: 00 48 8d 90 18 01 00 00 48 89 93 c0 00 00 00 4c 89 a0 18 01 00 00 48 8b 83 a0 00 00 00 83 80 28 01 00 00 01 e8 e6 6f a0 00 eb 92 <0f> 0b 66 90 0f 1f 44 00 00 55 48 89 e5 41 57 41 56 49 89 fe 41
        RIP  [<ffffffff811d9c0c>] dput+0x1dc/0x1e0
         RSP <ffff88001a3ebef8>
        ---[ end trace 1277bcfd9561ddb0 ]---
      
      Fix it by dropping the unnecessary dget/dput() pair.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Acked-by: default avatarLi Zefan <lizefan@huawei.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      Signed-off-by: default avatarWilly Tarreau <w@1wt.eu>
      5e224edf
    • Bjorn Helgaas's avatar
      Driver core: treat unregistered bus_types as having no devices · a21f3aa1
      Bjorn Helgaas authored
      commit 4fa3e78b upstream.
      
      A bus_type has a list of devices (klist_devices), but the list and the
      subsys_private structure that contains it are not initialized until the
      bus_type is registered with bus_register().
      
      The panic/reboot path has fixups that look up devices in pci_bus_type.  If
      we panic before registering pci_bus_type, the bus_type exists but the list
      does not, so mach_reboot_fixups() trips over a null pointer and panics
      again:
      
          mach_reboot_fixups
            pci_get_device
              ..
                bus_find_device(&pci_bus_type, ...)
                  bus->p is NULL
      
      Joonsoo reported a problem when panicking before PCI was initialized.
      I think this patch should be sufficient to replace the patch he posted
      here: https://lkml.org/lkml/2012/12/28/75 ("[PATCH] x86, reboot: skip
      reboot_fixups in early boot phase")
      Reported-by: default avatarJoonsoo Kim <js1304@gmail.com>
      Signed-off-by: default avatarBjorn Helgaas <bhelgaas@google.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      Signed-off-by: default avatarWilly Tarreau <w@1wt.eu>
      a21f3aa1
    • T Makphaibulchoke's avatar
      kernel/resource.c: fix stack overflow in __reserve_region_with_split() · 0cb1546c
      T Makphaibulchoke authored
      commit 4965f566 upstream.
      
      Using a recursive call add a non-conflicting region in
      __reserve_region_with_split() could result in a stack overflow in the case
      that the recursive calls are too deep.  Convert the recursive calls to an
      iterative loop to avoid the problem.
      
      Tested on a machine containing 135 regions.  The kernel no longer panicked
      with stack overflow.
      
      Also tested with code arbitrarily adding regions with no conflict,
      embedding two consecutive conflicts and embedding two non-consecutive
      conflicts.
      Signed-off-by: default avatarT Makphaibulchoke <tmac@hp.com>
      Reviewed-by: default avatarRam Pai <linuxram@us.ibm.com>
      Cc: Paul Gortmaker <paul.gortmaker@gmail.com>
      Cc: Wei Yang <weiyang@linux.vnet.ibm.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Cc: Jiri Slaby <jslaby@suse.cz>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      Signed-off-by: default avatarWilly Tarreau <w@1wt.eu>
      0cb1546c
    • Thadeu Lima de Souza Cascardo's avatar
      genalloc: stop crashing the system when destroying a pool · b4c006c3
      Thadeu Lima de Souza Cascardo authored
      commit eedce141 upstream.
      
      The genalloc code uses the bitmap API from include/linux/bitmap.h and
      lib/bitmap.c, which is based on long values.  Both bitmap_set from
      lib/bitmap.c and bitmap_set_ll, which is the lockless version from
      genalloc.c, use BITMAP_LAST_WORD_MASK to set the first bits in a long in
      the bitmap.
      
      That one uses (1 << bits) - 1, 0b111, if you are setting the first three
      bits.  This means that the API counts from the least significant bits
      (LSB from now on) to the MSB.  The LSB in the first long is bit 0, then.
      The same works for the lookup functions.
      
      The genalloc code uses longs for the bitmap, as it should.  In
      include/linux/genalloc.h, struct gen_pool_chunk has unsigned long
      bits[0] as its last member.  When allocating the struct, genalloc should
      reserve enough space for the bitmap.  This should be a proper number of
      longs that can fit the amount of bits in the bitmap.
      
      However, genalloc allocates an integer number of bytes that fit the
      amount of bits, but may not be an integer amount of longs.  9 bytes, for
      example, could be allocated for 70 bits.
      
      This is a problem in itself if the Least Significat Bit in a long is in
      the byte with the largest address, which happens in Big Endian machines.
      This means genalloc is not allocating the byte in which it will try to
      set or check for a bit.
      
      This may end up in memory corruption, where genalloc will try to set the
      bits it has not allocated.  In fact, genalloc may not set these bits
      because it may find them already set, because they were not zeroed since
      they were not allocated.  And that's what causes a BUG when
      gen_pool_destroy is called and check for any set bits.
      
      What really happens is that genalloc uses kmalloc_node with __GFP_ZERO
      on gen_pool_add_virt.  With SLAB and SLUB, this means the whole slab
      will be cleared, not only the requested bytes.  Since struct
      gen_pool_chunk has a size that is a multiple of 8, and slab sizes are
      multiples of 8, we get lucky and allocate and clear the right amount of
      bytes.
      
      Hower, this is not the case with SLOB or with older code that did memset
      after allocating instead of using __GFP_ZERO.
      
      So, a simple module as this (running 3.6.0), will cause a crash when
      rmmod'ed.
      
        [root@phantom-lp2 foo]# cat foo.c
        #include <linux/kernel.h>
        #include <linux/module.h>
        #include <linux/init.h>
        #include <linux/genalloc.h>
      
        MODULE_LICENSE("GPL");
        MODULE_VERSION("0.1");
      
        static struct gen_pool *foo_pool;
      
        static __init int foo_init(void)
        {
                int ret;
                foo_pool = gen_pool_create(10, -1);
                if (!foo_pool)
                        return -ENOMEM;
                ret = gen_pool_add(foo_pool, 0xa0000000, 32 << 10, -1);
                if (ret) {
                        gen_pool_destroy(foo_pool);
                        return ret;
                }
                return 0;
        }
      
        static __exit void foo_exit(void)
        {
                gen_pool_destroy(foo_pool);
        }
      
        module_init(foo_init);
        module_exit(foo_exit);
        [root@phantom-lp2 foo]# zcat /proc/config.gz | grep SLOB
        CONFIG_SLOB=y
        [root@phantom-lp2 foo]# insmod ./foo.ko
        [root@phantom-lp2 foo]# rmmod foo
        ------------[ cut here ]------------
        kernel BUG at lib/genalloc.c:243!
        cpu 0x4: Vector: 700 (Program Check) at [c0000000bb0e7960]
            pc: c0000000003cb50c: .gen_pool_destroy+0xac/0x110
            lr: c0000000003cb4fc: .gen_pool_destroy+0x9c/0x110
            sp: c0000000bb0e7be0
           msr: 8000000000029032
          current = 0xc0000000bb0e0000
          paca    = 0xc000000006d30e00   softe: 0        irq_happened: 0x01
            pid   = 13044, comm = rmmod
        kernel BUG at lib/genalloc.c:243!
        [c0000000bb0e7ca0] d000000004b00020 .foo_exit+0x20/0x38 [foo]
        [c0000000bb0e7d20] c0000000000dff98 .SyS_delete_module+0x1a8/0x290
        [c0000000bb0e7e30] c0000000000097d4 syscall_exit+0x0/0x94
        --- Exception: c00 (System Call) at 000000800753d1a0
        SP (fffd0b0e640) is in userspace
      Signed-off-by: default avatarThadeu Lima de Souza Cascardo <cascardo@linux.vnet.ibm.com>
      Cc: Paul Gortmaker <paul.gortmaker@windriver.com>
      Cc: Benjamin Gaignard <benjamin.gaignard@stericsson.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      Signed-off-by: default avatarWilly Tarreau <w@1wt.eu>
      b4c006c3
    • Steven Rostedt's avatar
      ring-buffer: Fix race between integrity check and readers · 501563ac
      Steven Rostedt authored
      commit 54f7be5b upstream.
      
      The function rb_check_pages() was added to make sure the ring buffer's
      pages were sane. This check is done when the ring buffer size is modified
      as well as when the iterator is released (closing the "trace" file),
      as that was considered a non fast path and a good place to do a sanity
      check.
      
      The problem is that the check does not have any locks around it.
      If one process were to read the trace file, and another were to read
      the raw binary file, the check could happen while the reader is reading
      the file.
      
      The issues with this is that the check requires to clear the HEAD page
      before doing the full check and it restores it afterward. But readers
      require the HEAD page to exist before it can read the buffer, otherwise
      it gives a nasty warning and disables the buffer.
      
      By adding the reader lock around the check, this keeps the race from
      happening.
      Signed-off-by: default avatarSteven Rostedt <rostedt@goodmis.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      Signed-off-by: default avatarWilly Tarreau <w@1wt.eu>
      501563ac
    • Shawn Guo's avatar
      kernel/sys.c: call disable_nonboot_cpus() in kernel_restart() · a289a811
      Shawn Guo authored
      commit f96972f2 upstream.
      
      As kernel_power_off() calls disable_nonboot_cpus(), we may also want to
      have kernel_restart() call disable_nonboot_cpus().  Doing so can help
      machines that require boot cpu be the last alive cpu during reboot to
      survive with kernel restart.
      
      This fixes one reboot issue seen on imx6q (Cortex-A9 Quad).  The machine
      requires that the restart routine be run on the primary cpu rather than
      secondary ones.  Otherwise, the secondary core running the restart
      routine will fail to come to online after reboot.
      Signed-off-by: default avatarShawn Guo <shawn.guo@linaro.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      Signed-off-by: default avatarWilly Tarreau <w@1wt.eu>
      a289a811
    • Denys Vlasenko's avatar
      coredump: prevent double-free on an error path in core dumper · 8b7435d2
      Denys Vlasenko authored
      commit f34f9d18 upstream.
      
      In !CORE_DUMP_USE_REGSET case, if elf_note_info_init fails to allocate
      memory for info->fields, it frees already allocated stuff and returns
      error to its caller, fill_note_info.  Which in turn returns error to its
      caller, elf_core_dump.  Which jumps to cleanup label and calls
      free_note_info, which will happily try to free all info->fields again.
      BOOM.
      
      This is the fix.
      Signed-off-by: default avatarOleg Nesterov <oleg@redhat.com>
      Signed-off-by: default avatarDenys Vlasenko <vda.linux@googlemail.com>
      Cc: Venu Byravarasu <vbyravarasu@nvidia.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      Signed-off-by: default avatarWilly Tarreau <w@1wt.eu>
      8b7435d2
    • Oleg Nesterov's avatar
      wake_up_process() should be never used to wakeup a TASK_STOPPED/TRACED task · 16365e5b
      Oleg Nesterov authored
      wake_up_process() should be never used to wakeup a TASK_STOPPED/TRACED task
      
      CVE-2013-0871
      
      BugLink: http://bugs.launchpad.net/bugs/1129192
      
      wake_up_process() should never wakeup a TASK_STOPPED/TRACED task.
      Change it to use TASK_NORMAL and add the WARN_ON().
      
      TASK_ALL has no other users, probably can be killed.
      Signed-off-by: default avatarOleg Nesterov <oleg@redhat.com>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      (backported from commit 9067ac85)
      
      Conflicts:
      	kernel/sched/core.c
      Signed-off-by: default avatarLuis Henriques <luis.henriques@canonical.com>
      Acked-by: default avatarColin King <colin.king@canonical.com>
      Signed-off-by: default avatarTim Gardner <tim.gardner@canonical.com>
      Signed-off-by: default avatarWilly Tarreau <w@1wt.eu>
      16365e5b
    • Andrew Morton's avatar
      kernel/signal.c: use __ARCH_HAS_SA_RESTORER instead of SA_RESTORER · 77ae04b3
      Andrew Morton authored
      commit 522cff14 upstream.
      
      __ARCH_HAS_SA_RESTORER is the preferred conditional for use in 3.9 and
      later kernels, per Kees.
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Cc: Emese Revfy <re.emese@gmail.com>
      Cc: Emese Revfy <re.emese@gmail.com>
      Cc: PaX Team <pageexec@freemail.hu>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: "Eric W. Biederman" <ebiederm@xmission.com>
      Cc: Serge Hallyn <serge.hallyn@canonical.com>
      Cc: Julien Tinnes <jln@google.com>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Cc: Ben Hutchings <ben@decadent.org.uk>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      Signed-off-by: default avatarWilly Tarreau <w@1wt.eu>
      77ae04b3
    • Ben Hutchings's avatar
      signal: Define __ARCH_HAS_SA_RESTORER so we know whether to clear sa_restorer · e3542c8c
      Ben Hutchings authored
      Vaguely based on upstream commit 574c4866 'consolidate kernel-side
      struct sigaction declarations'.
      
      flush_signal_handlers() needs to know whether sigaction::sa_restorer
      is defined, not whether SA_RESTORER is defined.  Define the
      __ARCH_HAS_SA_RESTORER macro to indicate this.
      Signed-off-by: default avatarBen Hutchings <ben@decadent.org.uk>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      Signed-off-by: default avatarWilly Tarreau <w@1wt.eu>
      e3542c8c
    • Emese Revfy's avatar
      kernel/signal.c: stop info leak via the tkill and the tgkill syscalls · d7328a9b
      Emese Revfy authored
      commit b9e146d8 upstream.
      
      This fixes a kernel memory contents leak via the tkill and tgkill syscalls
      for compat processes.
      
      This is visible in the siginfo_t->_sifields._rt.si_sigval.sival_ptr field
      when handling signals delivered from tkill.
      
      The place of the infoleak:
      
      int copy_siginfo_to_user32(compat_siginfo_t __user *to, siginfo_t *from)
      {
              ...
              put_user_ex(ptr_to_compat(from->si_ptr), &to->si_ptr);
              ...
      }
      Signed-off-by: default avatarEmese Revfy <re.emese@gmail.com>
      Reviewed-by: default avatarPaX Team <pageexec@freemail.hu>
      Signed-off-by: default avatarKees Cook <keescook@chromium.org>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: "Eric W. Biederman" <ebiederm@xmission.com>
      Cc: Serge Hallyn <serge.hallyn@canonical.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      Signed-off-by: default avatarWilly Tarreau <w@1wt.eu>
      d7328a9b
    • John Johansen's avatar
      ptrace: Fix ptrace when task is in task_is_stopped() state · 96ec5658
      John Johansen authored
      This patch fixes a regression in ptrace, introduced by commit 9e74eb39
      (backport of 9899d11f) which makes assumptions about ptrace behavior
      which are not true in the 2.6.32 kernel.
      
      BugLink: http://bugs.launchpad.net/bugs/1145234
      
      9899d11f makes the assumption that task_is_stopped() is not a valid state
      in ptrace because it is built on top of a series of patches which change
      how the TASK_STOPPED state is tracked (321fb561 which requires d79fdd6d
      and several other patches).
      
      Because Lucid does not have the set of patches that make task_is_stopped()
      an invalid state in ptrace_check_attach, partially revert 9e74eb39 so
      that ptrace_check_attach() correctly handles task_is_stopped(). However
      we must replace the assignment of TASK_TRACED with __TASK_TRACED to
      ensure TASK_WAKEKILL is cleared.
      Signed-off-by: default avatarJohn Johansen <john.johansen@canonical.com>
      Acked-by: default avatarColin King <colin.king@canonical.com>
      Acked-by: default avatarStefan Bader <stefan.bader@canonical.com>
      Acked-by: default avatarLuis Henriques <luis.henriques@canonical.com>
      Signed-off-by: default avatarTim Gardner <tim.gardner@canonical.com>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Signed-off-by: default avatarWilly Tarreau <w@1wt.eu>
      96ec5658
    • Oleg Nesterov's avatar
      ptrace: ensure arch_ptrace/ptrace_request can never race with SIGKILL · 062bdcb3
      Oleg Nesterov authored
      ptrace: ensure arch_ptrace/ptrace_request can never race with SIGKILL
      
      CVE-2013-0871
      
      BugLink: http://bugs.launchpad.net/bugs/1129192
      
      putreg() assumes that the tracee is not running and pt_regs_access() can
      safely play with its stack.  However a killed tracee can return from
      ptrace_stop() to the low-level asm code and do RESTORE_REST, this means
      that debugger can actually read/modify the kernel stack until the tracee
      does SAVE_REST again.
      
      set_task_blockstep() can race with SIGKILL too and in some sense this
      race is even worse, the very fact the tracee can be woken up breaks the
      logic.
      
      As Linus suggested we can clear TASK_WAKEKILL around the arch_ptrace()
      call, this ensures that nobody can ever wakeup the tracee while the
      debugger looks at it.  Not only this fixes the mentioned problems, we
      can do some cleanups/simplifications in arch_ptrace() paths.
      
      Probably ptrace_unfreeze_traced() needs more callers, for example it
      makes sense to make the tracee killable for oom-killer before
      access_process_vm().
      
      While at it, add the comment into may_ptrace_stop() to explain why
      ptrace_stop() still can't rely on SIGKILL and signal_pending_state().
      Reported-by: default avatarSalman Qazi <sqazi@google.com>
      Reported-by: default avatarSuleiman Souhlal <suleiman@google.com>
      Suggested-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarOleg Nesterov <oleg@redhat.com>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      (backported from commit 9899d11f)
      
      Conflicts:
      	arch/x86/kernel/step.c
      	kernel/ptrace.c
      	kernel/signal.c
      Signed-off-by: default avatarLuis Henriques <luis.henriques@canonical.com>
      Acked-by: default avatarColin King <colin.king@canonical.com>
      Signed-off-by: default avatarTim Gardner <tim.gardner@canonical.com>
      Signed-off-by: default avatarWilly Tarreau <w@1wt.eu>
      062bdcb3
    • Oleg Nesterov's avatar
      ptrace: introduce signal_wake_up_state() and ptrace_signal_wake_up() · 905f1272
      Oleg Nesterov authored
      ptrace: introduce signal_wake_up_state() and ptrace_signal_wake_up()
      
      CVE-2013-0871
      
      BugLink: http://bugs.launchpad.net/bugs/1129192
      
      Cleanup and preparation for the next change.
      
      signal_wake_up(resume => true) is overused. None of ptrace/jctl callers
      actually want to wakeup a TASK_WAKEKILL task, but they can't specify the
      necessary mask.
      
      Turn signal_wake_up() into signal_wake_up_state(state), reintroduce
      signal_wake_up() as a trivial helper, and add ptrace_signal_wake_up()
      which adds __TASK_TRACED.
      
      This way ptrace_signal_wake_up() can work "inside" ptrace_request()
      even if the tracee doesn't have the TASK_WAKEKILL bit set.
      Signed-off-by: default avatarOleg Nesterov <oleg@redhat.com>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      (backported from commit 910ffdb1)
      
      Conflicts:
      	kernel/ptrace.c
      	kernel/signal.c
      Signed-off-by: default avatarLuis Henriques <luis.henriques@canonical.com>
      Acked-by: default avatarColin King <colin.king@canonical.com>
      Signed-off-by: default avatarTim Gardner <tim.gardner@canonical.com>
      Signed-off-by: default avatarWilly Tarreau <w@1wt.eu>
      905f1272
    • Oleg Nesterov's avatar
      ptrace: ptrace_resume() shouldn't wake up !TASK_TRACED thread · fd2ab7dc
      Oleg Nesterov authored
      ptrace: ptrace_resume() shouldn't wake up !TASK_TRACED thread
      
      CVE-2013-0871
      
      BugLink: http://bugs.launchpad.net/bugs/1129192
      
      It is not clear why ptrace_resume() does wake_up_process(). Unless the
      caller is PTRACE_KILL the tracee should be TASK_TRACED so we can use
      wake_up_state(__TASK_TRACED). If sys_ptrace() races with SIGKILL we do
      not need the extra and potentionally spurious wakeup.
      
      If the caller is PTRACE_KILL, wake_up_process() is even more wrong.
      The tracee can sleep in any state in any place, and if we have a buggy
      code which doesn't handle a spurious wakeup correctly PTRACE_KILL can
      be used to exploit it. For example:
      
      	int main(void)
      	{
      		int child, status;
      
      		child = fork();
      		if (!child) {
      			int ret;
      
      			assert(ptrace(PTRACE_TRACEME, 0,0,0) == 0);
      
      			ret = pause();
      			printf("pause: %d %m\n", ret);
      
      			return 0x23;
      		}
      
      		sleep(1);
      		assert(ptrace(PTRACE_KILL, child, 0,0) == 0);
      
      		assert(child == wait(&status));
      		printf("wait: %x\n", status);
      
      		return 0;
      	}
      
      prints "pause: -1 Unknown error 514", -ERESTARTNOHAND leaks to the
      userland. In this case sys_pause() is buggy as well and should be
      fixed.
      
      I do not know what was the original rationality behind PTRACE_KILL.
      The man page is simply wrong and afaics it was always wrong. Imho
      it should be deprecated, or may be it should do send_sig(SIGKILL)
      as Denys suggests, but in any case I do not think that the current
      behaviour was intentional.
      
      Note: there is another problem, ptrace_resume() changes ->exit_code
      and this can race with SIGKILL too. Eventually we should change ptrace
      to not use ->exit_code.
      Signed-off-by: default avatarOleg Nesterov <oleg@redhat.com>
      (cherry picked from commit 0666fb51)
      Signed-off-by: default avatarLuis Henriques <luis.henriques@canonical.com>
      Acked-by: default avatarColin King <colin.king@canonical.com>
      Signed-off-by: default avatarTim Gardner <tim.gardner@canonical.com>
      Signed-off-by: default avatarWilly Tarreau <w@1wt.eu>
      fd2ab7dc
    • Kees Cook's avatar
      signal: always clear sa_restorer on execve · 0433cb59
      Kees Cook authored
      commit 2ca39528 upstream.
      
      When the new signal handlers are set up, the location of sa_restorer is
      not cleared, leaking a parent process's address space location to
      children.  This allows for a potential bypass of the parent's ASLR by
      examining the sa_restorer value returned when calling sigaction().
      
      Based on what should be considered "secret" about addresses, it only
      matters across the exec not the fork (since the VMAs haven't changed
      until the exec).  But since exec sets SIG_DFL and keeps sa_restorer,
      this is where it should be fixed.
      
      Given the few uses of sa_restorer, a "set" function was not written
      since this would be the only use.  Instead, we use
      __ARCH_HAS_SA_RESTORER, as already done in other places.
      
      Example of the leak before applying this patch:
      
        $ cat /proc/$$/maps
        ...
        7fb9f3083000-7fb9f3238000 r-xp 00000000 fd:01 404469 .../libc-2.15.so
        ...
        $ ./leak
        ...
        7f278bc74000-7f278be29000 r-xp 00000000 fd:01 404469 .../libc-2.15.so
        ...
        1 0 (nil) 0x7fb9f30b94a0
        2 4000000 (nil) 0x7f278bcaa4a0
        3 4000000 (nil) 0x7f278bcaa4a0
        4 0 (nil) 0x7fb9f30b94a0
        ...
      
      [akpm@linux-foundation.org: use SA_RESTORER for backportability]
      Signed-off-by: default avatarKees Cook <keescook@chromium.org>
      Reported-by: default avatarEmese Revfy <re.emese@gmail.com>
      Cc: Emese Revfy <re.emese@gmail.com>
      Cc: PaX Team <pageexec@freemail.hu>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: "Eric W. Biederman" <ebiederm@xmission.com>
      Cc: Serge Hallyn <serge.hallyn@canonical.com>
      Cc: Julien Tinnes <jln@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      Signed-off-by: default avatarWilly Tarreau <w@1wt.eu>
      0433cb59
    • Kees Cook's avatar
      exec: use -ELOOP for max recursion depth · fa078af1
      Kees Cook authored
      commit d7402698 upstream
      
      To avoid an explosion of request_module calls on a chain of abusive
      scripts, fail maximum recursion with -ELOOP instead of -ENOEXEC. As soon
      as maximum recursion depth is hit, the error will fail all the way back
      up the chain, aborting immediately.
      
      This also has the side-effect of stopping the user's shell from attempting
      to reexecute the top-level file as a shell script. As seen in the
      dash source:
      
              if (cmd != path_bshell && errno == ENOEXEC) {
                      *argv-- = cmd;
                      *argv = cmd = path_bshell;
                      goto repeat;
              }
      
      The above logic was designed for running scripts automatically that lacked
      the "#!" header, not to re-try failed recursion. On a legitimate -ENOEXEC,
      things continue to behave as the shell expects.
      
      Additionally, when tracking recursion, the binfmt handlers should not be
      involved. The recursion being tracked is the depth of calls through
      search_binary_handler(), so that function should be exclusively responsible
      for tracking the depth.
      Signed-off-by: default avatarKees Cook <keescook@chromium.org>
      Cc: halfdog <me@halfdog.net>
      Cc: P J P <ppandit@redhat.com>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      [dannf: backported to Debian's 2.6.32]
      Signed-off-by: default avatarWilly Tarreau <w@1wt.eu>
      fa078af1
    • Kees Cook's avatar
      exec: do not leave bprm->interp on stack · 59e376d0
      Kees Cook authored
      commit b66c5984 upstream
      
      If a series of scripts are executed, each triggering module loading via
      unprintable bytes in the script header, kernel stack contents can leak
      into the command line.
      
      Normally execution of binfmt_script and binfmt_misc happens recursively.
      However, when modules are enabled, and unprintable bytes exist in the
      bprm->buf, execution will restart after attempting to load matching
      binfmt modules.  Unfortunately, the logic in binfmt_script and
      binfmt_misc does not expect to get restarted.  They leave bprm->interp
      pointing to their local stack.  This means on restart bprm->interp is
      left pointing into unused stack memory which can then be copied into the
      userspace argv areas.
      
      After additional study, it seems that both recursion and restart remains
      the desirable way to handle exec with scripts, misc, and modules.  As
      such, we need to protect the changes to interp.
      
      This changes the logic to require allocation for any changes to the
      bprm->interp.  To avoid adding a new kmalloc to every exec, the default
      value is left as-is.  Only when passing through binfmt_script or
      binfmt_misc does an allocation take place.
      
      For a proof of concept, see DoTest.sh from:
      
         http://www.halfdog.net/Security/2012/LinuxKernelBinfmtScriptStackDataDisclosure/Signed-off-by: default avatarKees Cook <keescook@chromium.org>
      Cc: halfdog <me@halfdog.net>
      Cc: P J P <ppandit@redhat.com>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      [dannf: backported to Debian's 2.6.32]
      Signed-off-by: default avatarWilly Tarreau <w@1wt.eu>
      59e376d0
    • Oleg Nesterov's avatar
      kmod: make __request_module() killable · ec4808e7
      Oleg Nesterov authored
      commit 1cc684ab upstream
      
      As Tetsuo Handa pointed out, request_module() can stress the system
      while the oom-killed caller sleeps in TASK_UNINTERRUPTIBLE.
      
      The task T uses "almost all" memory, then it does something which
      triggers request_module().  Say, it can simply call sys_socket().  This
      in turn needs more memory and leads to OOM.  oom-killer correctly
      chooses T and kills it, but this can't help because it sleeps in
      TASK_UNINTERRUPTIBLE and after that oom-killer becomes "disabled" by the
      TIF_MEMDIE task T.
      
      Make __request_module() killable.  The only necessary change is that
      call_modprobe() should kmalloc argv and module_name, they can't live in
      the stack if we use UMH_KILLABLE.  This memory is freed via
      call_usermodehelper_freeinfo()->cleanup.
      Reported-by: default avatarTetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
      Signed-off-by: default avatarOleg Nesterov <oleg@redhat.com>
      Cc: Rusty Russell <rusty@rustcorp.com.au>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: David Rientjes <rientjes@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      [dannf, bwh: backported to Debian's 2.6.32]
      Signed-off-by: default avatarWilly Tarreau <w@1wt.eu>
      ec4808e7
    • Oleg Nesterov's avatar
      kmod: introduce call_modprobe() helper · d5734d32
      Oleg Nesterov authored
      commit 3e63a93b upstream
      
      No functional changes.  Move the call_usermodehelper code from
      __request_module() into the new simple helper, call_modprobe().
      Signed-off-by: default avatarOleg Nesterov <oleg@redhat.com>
      Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
      Cc: Rusty Russell <rusty@rustcorp.com.au>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: David Rientjes <rientjes@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      [dannf: backported to Debian's 2.6.32]
      Signed-off-by: default avatarWilly Tarreau <w@1wt.eu>
      d5734d32
    • Oleg Nesterov's avatar
      usermodehelper: ____call_usermodehelper() doesn't need do_exit() · 67fd05cc
      Oleg Nesterov authored
      commit 5b9bd473 upstream
      
      Minor cleanup.  ____call_usermodehelper() can simply return, no need to
      call do_exit() explicitely.
      Signed-off-by: default avatarOleg Nesterov <oleg@redhat.com>
      Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
      Cc: Rusty Russell <rusty@rustcorp.com.au>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: David Rientjes <rientjes@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      [dannf: adjusted to apply to Debian's 2.6.32]
      Signed-off-by: default avatarWilly Tarreau <w@1wt.eu>
      67fd05cc
    • Oleg Nesterov's avatar
      usermodehelper: implement UMH_KILLABLE · e2a28e9a
      Oleg Nesterov authored
      commit d0bd587a upstream
      
      Implement UMH_KILLABLE, should be used along with UMH_WAIT_EXEC/PROC.
      The caller must ensure that subprocess_info->path/etc can not go away
      until call_usermodehelper_freeinfo().
      
      call_usermodehelper_exec(UMH_KILLABLE) does
      wait_for_completion_killable.  If it fails, it uses
      xchg(&sub_info->complete, NULL) to serialize with umh_complete() which
      does the same xhcg() to access sub_info->complete.
      
      If call_usermodehelper_exec wins, it can safely return.  umh_complete()
      should get NULL and call call_usermodehelper_freeinfo().
      
      Otherwise we know that umh_complete() was already called, in this case
      call_usermodehelper_exec() falls back to wait_for_completion() which
      should succeed "very soon".
      
      Note: UMH_NO_WAIT == -1 but it obviously should not be used with
      UMH_KILLABLE.  We delay the neccessary cleanup to simplify the back
      porting.
      Signed-off-by: default avatarOleg Nesterov <oleg@redhat.com>
      Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
      Cc: Rusty Russell <rusty@rustcorp.com.au>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: David Rientjes <rientjes@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      [dannf: backported to Debian's 2.6.32]
      Signed-off-by: default avatarWilly Tarreau <w@1wt.eu>
      e2a28e9a
    • Oleg Nesterov's avatar
      usermodehelper: introduce umh_complete(sub_info) · 13db7353
      Oleg Nesterov authored
      commit b3449922 upstream
      
      Preparation.  Add the new trivial helper, umh_complete().  Currently it
      simply does complete(sub_info->complete).
      Signed-off-by: default avatarOleg Nesterov <oleg@redhat.com>
      Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
      Cc: Rusty Russell <rusty@rustcorp.com.au>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: David Rientjes <rientjes@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      [dannf: Adjusted to apply to Debian's 2.6.32]
      Signed-off-by: default avatarWilly Tarreau <w@1wt.eu>
      13db7353
    • Kees Cook's avatar
      gen_init_cpio: avoid stack overflow when expanding · dbd3462b
      Kees Cook authored
      commit 20f1de65 upstream.
      
      Fix possible overflow of the buffer used for expanding environment
      variables when building file list.
      
      In the extremely unlikely case of an attacker having control over the
      environment variables visible to gen_init_cpio, control over the
      contents of the file gen_init_cpio parses, and gen_init_cpio was built
      without compiler hardening, the attacker can gain arbitrary execution
      control via a stack buffer overflow.
      
        $ cat usr/crash.list
        file foo ${BIG}${BIG}${BIG}${BIG}${BIG}${BIG} 0755 0 0
        $ BIG=$(perl -e 'print "A" x 4096;') ./usr/gen_init_cpio usr/crash.list
        *** buffer overflow detected ***: ./usr/gen_init_cpio terminated
      
      This also replaces the space-indenting with tabs.
      
      Patch based on existing fix extracted from grsecurity.
      Signed-off-by: default avatarKees Cook <keescook@chromium.org>
      Cc: Michal Marek <mmarek@suse.cz>
      Cc: Brad Spengler <spender@grsecurity.net>
      Cc: PaX Team <pageexec@freemail.hu>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      Signed-off-by: default avatarWilly Tarreau <w@1wt.eu>
      dbd3462b
    • Jean Delvare's avatar
      kbuild: Fix gcc -x syntax · 56fb3d90
      Jean Delvare authored
      This is upstream commit b1e0d8b7
      backported to the 2.6.32.x stable branch.
      
      The correct syntax for gcc -x is "gcc -x assembler", not
      "gcc -xassembler". Even though the latter happens to work, the former
      is what is documented in the manual page and thus what gcc wrappers
      such as icecream do expect.
      
      This isn't a cosmetic change. The missing space prevents icecream from
      recognizing compilation tasks it can't handle, leading to silent kernel
      miscompilations.
      
      Besides me, credits go to Michael Matz and Dirk Mueller for
      investigating the miscompilation issue and tracking it down to this
      incorrect -x parameter syntax.
      Signed-off-by: default avatarJean Delvare <jdelvare@suse.de>
      Cc: stable@vger.kernel.org
      Cc: Bernhard Walle <bernhard@bwalle.de>
      Cc: Michal Marek <mmarek@suse.cz>
      Cc: Ralf Baechle <ralf@linux-mips.org>
      Signed-off-by: default avatarWilly Tarreau <w@1wt.eu>
      56fb3d90
    • Thomas Gleixner's avatar
      tick: Cleanup NOHZ per cpu data on cpu down · d31e3b9e
      Thomas Gleixner authored
      commit 4b0c0f29 upstream.
      
      Prarit reported a crash on CPU offline/online. The reason is that on
      CPU down the NOHZ related per cpu data of the dead cpu is not cleaned
      up. If at cpu online an interrupt happens before the per cpu tick
      device is registered the irq_enter() check potentially sees stale data
      and dereferences a NULL pointer.
      
      Cleanup the data after the cpu is dead.
      Reported-by: default avatarPrarit Bhargava <prarit@redhat.com>
      Cc: Mike Galbraith <bitbucket@online.de>
      Link: http://lkml.kernel.org/r/alpine.LFD.2.02.1305031451561.2886@ionosSigned-off-by: default avatarThomas Gleixner <tglx@linutronix.de>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      Signed-off-by: default avatarWilly Tarreau <w@1wt.eu>
      d31e3b9e
    • Tirupathi Reddy's avatar
      timer: Don't reinitialize the cpu base lock during CPU_UP_PREPARE · 34d91822
      Tirupathi Reddy authored
      commit 42a5cf46 upstream.
      
      An inactive timer's base can refer to a offline cpu's base.
      
      In the current code, cpu_base's lock is blindly reinitialized each
      time a CPU is brought up. If a CPU is brought online during the period
      that another thread is trying to modify an inactive timer on that CPU
      with holding its timer base lock, then the lock will be reinitialized
      under its feet. This leads to following SPIN_BUG().
      
      <0> BUG: spinlock already unlocked on CPU#3, kworker/u:3/1466
      <0> lock: 0xe3ebe000, .magic: dead4ead, .owner: kworker/u:3/1466, .owner_cpu: 1
      <4> [<c0013dc4>] (unwind_backtrace+0x0/0x11c) from [<c026e794>] (do_raw_spin_unlock+0x40/0xcc)
      <4> [<c026e794>] (do_raw_spin_unlock+0x40/0xcc) from [<c076c160>] (_raw_spin_unlock+0x8/0x30)
      <4> [<c076c160>] (_raw_spin_unlock+0x8/0x30) from [<c009b858>] (mod_timer+0x294/0x310)
      <4> [<c009b858>] (mod_timer+0x294/0x310) from [<c00a5e04>] (queue_delayed_work_on+0x104/0x120)
      <4> [<c00a5e04>] (queue_delayed_work_on+0x104/0x120) from [<c04eae00>] (sdhci_msm_bus_voting+0x88/0x9c)
      <4> [<c04eae00>] (sdhci_msm_bus_voting+0x88/0x9c) from [<c04d8780>] (sdhci_disable+0x40/0x48)
      <4> [<c04d8780>] (sdhci_disable+0x40/0x48) from [<c04bf300>] (mmc_release_host+0x4c/0xb0)
      <4> [<c04bf300>] (mmc_release_host+0x4c/0xb0) from [<c04c7aac>] (mmc_sd_detect+0x90/0xfc)
      <4> [<c04c7aac>] (mmc_sd_detect+0x90/0xfc) from [<c04c2504>] (mmc_rescan+0x7c/0x2c4)
      <4> [<c04c2504>] (mmc_rescan+0x7c/0x2c4) from [<c00a6a7c>] (process_one_work+0x27c/0x484)
      <4> [<c00a6a7c>] (process_one_work+0x27c/0x484) from [<c00a6e94>] (worker_thread+0x210/0x3b0)
      <4> [<c00a6e94>] (worker_thread+0x210/0x3b0) from [<c00aad9c>] (kthread+0x80/0x8c)
      <4> [<c00aad9c>] (kthread+0x80/0x8c) from [<c000ea80>] (kernel_thread_exit+0x0/0x8)
      
      As an example, this particular crash occurred when CPU #3 is executing
      mod_timer() on an inactive timer whose base is refered to offlined CPU
      #2.  The code locked the timer_base corresponding to CPU #2. Before it
      could proceed, CPU #2 came online and reinitialized the spinlock
      corresponding to its base. Thus now CPU #3 held a lock which was
      reinitialized. When CPU #3 finally ended up unlocking the old cpu_base
      corresponding to CPU #2, we hit the above SPIN_BUG().
      
      CPU #0		CPU #3				       CPU #2
      ------		-------				       -------
      .....		 ......				      <Offline>
      		mod_timer()
      		 lock_timer_base
      		   spin_lock_irqsave(&base->lock)
      
      cpu_up(2)	 .....				        ......
      							init_timers_cpu()
      ....		 .....				    	spin_lock_init(&base->lock)
      .....		   spin_unlock_irqrestore(&base->lock)  ......
      		   <spin_bug>
      
      Allocation of per_cpu timer vector bases is done only once under
      "tvec_base_done[]" check. In the current code, spinlock_initialization
      of base->lock isn't under this check. When a CPU is up each time the
      base lock is reinitialized. Move base spinlock initialization under
      the check.
      Signed-off-by: default avatarTirupathi Reddy <tirupath@codeaurora.org>
      Link: http://lkml.kernel.org/r/1368520142-4136-1-git-send-email-tirupath@codeaurora.orgSigned-off-by: default avatarThomas Gleixner <tglx@linutronix.de>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      Signed-off-by: default avatarWilly Tarreau <w@1wt.eu>
      34d91822
    • Stanislaw Gruszka's avatar
      posix-cpu-timers: Fix nanosleep task_struct leak · 820e8b30
      Stanislaw Gruszka authored
      commit e6c42c29 upstream.
      
      The trinity fuzzer triggered a task_struct reference leak via
      clock_nanosleep with CPU_TIMERs. do_cpu_nanosleep() calls
      posic_cpu_timer_create(), but misses a corresponding
      posix_cpu_timer_del() which leads to the task_struct reference leak.
      Reported-and-tested-by: default avatarTommi Rantala <tt.rantala@gmail.com>
      Signed-off-by: default avatarStanislaw Gruszka <sgruszka@redhat.com>
      Cc: Dave Jones <davej@redhat.com>
      Cc: John Stultz <john.stultz@linaro.org>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Link: http://lkml.kernel.org/r/20130215100810.GF4392@redhat.comSigned-off-by: default avatarThomas Gleixner <tglx@linutronix.de>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      Signed-off-by: default avatarWilly Tarreau <w@1wt.eu>
      820e8b30
    • Mark Rutland's avatar
      clockevents: Don't allow dummy broadcast timers · 30145daa
      Mark Rutland authored
      commit a7dc19b8 upstream.
      
      Currently tick_check_broadcast_device doesn't reject clock_event_devices
      with CLOCK_EVT_FEAT_DUMMY, and may select them in preference to real
      hardware if they have a higher rating value. In this situation, the
      dummy timer is responsible for broadcasting to itself, and the core
      clockevents code may attempt to call non-existent callbacks for
      programming the dummy, eventually leading to a panic.
      
      This patch makes tick_check_broadcast_device always reject dummy timers,
      preventing this problem.
      Signed-off-by: default avatarMark Rutland <mark.rutland@arm.com>
      Cc: linux-arm-kernel@lists.infradead.org
      Cc: Jon Medhurst (Tixy) <tixy@linaro.org>
      Signed-off-by: default avatarThomas Gleixner <tglx@linutronix.de>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      Signed-off-by: default avatarWilly Tarreau <w@1wt.eu>
      30145daa
    • John Stultz's avatar
      2.6.32.y: timekeeping: Fix nohz issue with commit 61b76840 · d556d326
      John Stultz authored
      Commit 61b76840 ("time: Avoid
      making adjustments if we haven't accumulated anything")
      introduced a regression with nohz.
      
      Basically with kernels between 2.6.20-something to 2.6.32,
      we accumulate time in half second chunks, rather then every
      timer-tick. This was added because when NOHZ landed, if you
      were idle for a few seconds, you had to spin for every tick
      we skipped in the accumulation loop, which created some bad
      latencies.
      
      However, this required that we create the xtime_cache() which
      was still updated each tick, so that filesystem timestamps,
      etc continued to see time increment normally.
      
      Of course, the xtime_cache is updated at the bottom of
      update_wall_time(). So the early return on
      (offset < timekeeper.cycle_interval), added by the problematic
      commit causes the xtime_cache to not be updated.
      
      This can cause code using current_kernel_time() (like the mqueue
      code) or hrtimer_get_softirq_time(), which uses the non-updated
      xtime_cache, to see timers to fire with very coarse half-second
      granularity.
      
      Many thanks to Romain for describing the issue clearly,
      providing test case to reproduce it and helping with testing
      the solution.
      
      This change is for 2.6.32-stable ONLY!
      
      Cc: stable@vger.kernel.org
      Cc: Willy Tarreau <w@1wt.eu>
      Cc: Romain Francoise <romain@orebokech.com>
      Reported-by: default avatarRomain Francoise <romain@orebokech.com>
      Tested-by: default avatarRomain Francoise <romain@orebokech.com>
      Signed-off-by: default avatarJohn Stultz <john.stultz@linaro.org>
      Signed-off-by: default avatarWilly Tarreau <w@1wt.eu>
      d556d326
    • Jens Axboe's avatar
      Revert "block: improve queue_should_plug() by looking at IO depths" · ec2826bc
      Jens Axboe authored
      This reverts commit fb1e7538.
      
      "Benjamin S." <sbenni@gmx.de> reports that the patch in question
      causes a big drop in sequential throughput for him, dropping from
      200MB/sec down to only 70MB/sec.
      
      Needs to be investigated more fully, for now lets just revert the
      offending commit.
      
      Conflicts:
      
      	include/linux/blkdev.h
      Signed-off-by: default avatarJens Axboe <jens.axboe@oracle.com>
      (cherry picked from commit 79da0644)
      Cc: Thomas Bork <tom@eisfair.net>
      Signed-off-by: default avatarWilly Tarreau <w@1wt.eu>
      ec2826bc