1. 13 Feb, 2014 12 commits
    • Johannes Weiner's avatar
      mm/page-writeback.c: do not count anon pages as dirtyable memory · 48526149
      Johannes Weiner authored
      commit a1c3bfb2 upstream.
      
      The VM is currently heavily tuned to avoid swapping.  Whether that is
      good or bad is a separate discussion, but as long as the VM won't swap
      to make room for dirty cache, we can not consider anonymous pages when
      calculating the amount of dirtyable memory, the baseline to which
      dirty_background_ratio and dirty_ratio are applied.
      
      A simple workload that occupies a significant size (40+%, depending on
      memory layout, storage speeds etc.) of memory with anon/tmpfs pages and
      uses the remainder for a streaming writer demonstrates this problem.  In
      that case, the actual cache pages are a small fraction of what is
      considered dirtyable overall, which results in an relatively large
      portion of the cache pages to be dirtied.  As kswapd starts rotating
      these, random tasks enter direct reclaim and stall on IO.
      
      Only consider free pages and file pages dirtyable.
      Signed-off-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Reported-by: default avatarTejun Heo <tj@kernel.org>
      Tested-by: default avatarTejun Heo <tj@kernel.org>
      Reviewed-by: default avatarRik van Riel <riel@redhat.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Wu Fengguang <fengguang.wu@intel.com>
      Reviewed-by: default avatarMichal Hocko <mhocko@suse.cz>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      48526149
    • Johannes Weiner's avatar
      mm/page-writeback.c: fix dirty_balance_reserve subtraction from dirtyable memory · 03381bd2
      Johannes Weiner authored
      commit a804552b upstream.
      
      Tejun reported stuttering and latency spikes on a system where random
      tasks would enter direct reclaim and get stuck on dirty pages.  Around
      50% of memory was occupied by tmpfs backed by an SSD, and another disk
      (rotating) was reading and writing at max speed to shrink a partition.
      
      : The problem was pretty ridiculous.  It's a 8gig machine w/ one ssd and 10k
      : rpm harddrive and I could reliably reproduce constant stuttering every
      : several seconds for as long as buffered IO was going on on the hard drive
      : either with tmpfs occupying somewhere above 4gig or a test program which
      : allocates about the same amount of anon memory.  Although swap usage was
      : zero, turning off swap also made the problem go away too.
      :
      : The trigger conditions seem quite plausible - high anon memory usage w/
      : heavy buffered IO and swap configured - and it's highly likely that this
      : is happening in the wild too.  (this can happen with copying large files
      : to usb sticks too, right?)
      
      This patch (of 2):
      
      The dirty_balance_reserve is an approximation of the fraction of free
      pages that the page allocator does not make available for page cache
      allocations.  As a result, it has to be taken into account when
      calculating the amount of "dirtyable memory", the baseline to which
      dirty_background_ratio and dirty_ratio are applied.
      
      However, currently the reserve is subtracted from the sum of free and
      reclaimable pages, which is non-sensical and leads to erroneous results
      when the system is dominated by unreclaimable pages and the
      dirty_balance_reserve is bigger than free+reclaimable.  In that case, at
      least the already allocated cache should be considered dirtyable.
      
      Fix the calculation by subtracting the reserve from the amount of free
      pages, then adding the reclaimable pages on top.
      
      [akpm@linux-foundation.org: fix CONFIG_HIGHMEM build]
      Signed-off-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Reported-by: default avatarTejun Heo <tj@kernel.org>
      Tested-by: default avatarTejun Heo <tj@kernel.org>
      Reviewed-by: default avatarRik van Riel <riel@redhat.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Wu Fengguang <fengguang.wu@intel.com>
      Reviewed-by: default avatarMichal Hocko <mhocko@suse.cz>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      03381bd2
    • Naoya Horiguchi's avatar
      mm/memory-failure.c: shift page lock from head page to tail page after thp split · 9fa1577a
      Naoya Horiguchi authored
      commit 54b9dd14 upstream.
      
      After thp split in hwpoison_user_mappings(), we hold page lock on the
      raw error page only between try_to_unmap, hence we are in danger of race
      condition.
      
      I found in the RHEL7 MCE-relay testing that we have "bad page" error
      when a memory error happens on a thp tail page used by qemu-kvm:
      
        Triggering MCE exception on CPU 10
        mce: [Hardware Error]: Machine check events logged
        MCE exception done on CPU 10
        MCE 0x38c535: Killing qemu-kvm:8418 due to hardware memory corruption
        MCE 0x38c535: dirty LRU page recovery: Recovered
        qemu-kvm[8418]: segfault at 20 ip 00007ffb0f0f229a sp 00007fffd6bc5240 error 4 in qemu-kvm[7ffb0ef14000+420000]
        BUG: Bad page state in process qemu-kvm  pfn:38c400
        page:ffffea000e310000 count:0 mapcount:0 mapping:          (null) index:0x7ffae3c00
        page flags: 0x2fffff0008001d(locked|referenced|uptodate|dirty|swapbacked)
        Modules linked in: hwpoison_inject mce_inject vhost_net macvtap macvlan ...
        CPU: 0 PID: 8418 Comm: qemu-kvm Tainted: G   M        --------------   3.10.0-54.0.1.el7.mce_test_fixed.x86_64 #1
        Hardware name: NEC NEC Express5800/R120b-1 [N8100-1719F]/MS-91E7-001, BIOS 4.6.3C19 02/10/2011
        Call Trace:
          dump_stack+0x19/0x1b
          bad_page.part.59+0xcf/0xe8
          free_pages_prepare+0x148/0x160
          free_hot_cold_page+0x31/0x140
          free_hot_cold_page_list+0x46/0xa0
          release_pages+0x1c1/0x200
          free_pages_and_swap_cache+0xad/0xd0
          tlb_flush_mmu.part.46+0x4c/0x90
          tlb_finish_mmu+0x55/0x60
          exit_mmap+0xcb/0x170
          mmput+0x67/0xf0
          vhost_dev_cleanup+0x231/0x260 [vhost_net]
          vhost_net_release+0x3f/0x90 [vhost_net]
          __fput+0xe9/0x270
          ____fput+0xe/0x10
          task_work_run+0xc4/0xe0
          do_exit+0x2bb/0xa40
          do_group_exit+0x3f/0xa0
          get_signal_to_deliver+0x1d0/0x6e0
          do_signal+0x48/0x5e0
          do_notify_resume+0x71/0xc0
          retint_signal+0x48/0x8c
      
      The reason of this bug is that a page fault happens before unlocking the
      head page at the end of memory_failure().  This strange page fault is
      trying to access to address 0x20 and I'm not sure why qemu-kvm does
      this, but anyway as a result the SIGSEGV makes qemu-kvm exit and on the
      way we catch the bad page bug/warning because we try to free a locked
      page (which was the former head page.)
      
      To fix this, this patch suggests to shift page lock from head page to
      tail page just after thp split.  SIGSEGV still happens, but it affects
      only error affected VMs, not a whole system.
      Signed-off-by: default avatarNaoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: Andi Kleen <andi@firstfloor.org>
      Cc: Wanpeng Li <liwanp@linux.vnet.ibm.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      9fa1577a
    • AKASHI Takahiro's avatar
      audit: correct a type mismatch in audit_syscall_exit() · 186b643a
      AKASHI Takahiro authored
      commit 06bdadd7 upstream.
      
      audit_syscall_exit() saves a result of regs_return_value() in intermediate
      "int" variable and passes it to __audit_syscall_exit(), which expects its
      second argument as a "long" value.  This will result in truncating the
      value returned by a system call and making a wrong audit record.
      
      I don't know why gcc compiler doesn't complain about this, but anyway it
      causes a problem at runtime on arm64 (and probably most 64-bit archs).
      Signed-off-by: default avatarAKASHI Takahiro <takahiro.akashi@linaro.org>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Eric Paris <eparis@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarEric Paris <eparis@redhat.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      186b643a
    • Richard Guy Briggs's avatar
      audit: reset audit backlog wait time after error recovery · 34210bee
      Richard Guy Briggs authored
      commit e789e561 upstream.
      
      When the audit queue overflows and times out (audit_backlog_wait_time), the
      audit queue overflow timeout is set to zero.  Once the audit queue overflow
      timeout condition recovers, the timeout should be reset to the original value.
      
      See also:
      	https://lkml.org/lkml/2013/9/2/473Signed-off-by: default avatarLuiz Capitulino <lcapitulino@redhat.com>
      Signed-off-by: default avatarDan Duval <dan.duval@oracle.com>
      Signed-off-by: default avatarChuck Anderson <chuck.anderson@oracle.com>
      Signed-off-by: default avatarRichard Guy Briggs <rgb@redhat.com>
      Signed-off-by: default avatarEric Paris <eparis@redhat.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      34210bee
    • Miklos Szeredi's avatar
      fuse: fix pipe_buf_operations · d840f989
      Miklos Szeredi authored
      commit 28a625cb upstream.
      
      Having this struct in module memory could Oops when if the module is
      unloaded while the buffer still persists in a pipe.
      
      Since sock_pipe_buf_ops is essentially the same as fuse_dev_pipe_buf_steal
      merge them into nosteal_pipe_buf_ops (this is the same as
      default_pipe_buf_ops except stealing the page from the buffer is not
      allowed).
      Reported-by: default avatarAl Viro <viro@zeniv.linux.org.uk>
      Signed-off-by: default avatarMiklos Szeredi <mszeredi@suse.cz>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      d840f989
    • Bjorn Helgaas's avatar
      Revert "EISA: Initialize device before its resources" · 329f4557
      Bjorn Helgaas authored
      commit 765ee51f upstream.
      
      This reverts commit 26abfeed.
      
      In the eisa_probe() force_probe path, if we were unable to request slot
      resources (e.g., [io 0x800-0x8ff]), we skipped the slot with "Cannot
      allocate resource for EISA slot %d" before reading the EISA signature in
      eisa_init_device().
      
      Commit 26abfeed moved eisa_init_device() earlier, so we tried to read
      the EISA signature before requesting the slot resources, and this caused
      hangs during boot.
      
      Link: https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1251816Signed-off-by: default avatarBjorn Helgaas <bhelgaas@google.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      329f4557
    • Alex Williamson's avatar
      intel-iommu: fix off-by-one in pagetable freeing · 32df365d
      Alex Williamson authored
      commit 08336fd2 upstream.
      
      dma_pte_free_level() has an off-by-one error when checking whether a pte
      is completely covered by a range.  Take for example the case of
      attempting to free pfn 0x0 - 0x1ff, ie.  512 entries covering the first
      2M superpage.
      
      The level_size() is 0x200 and we test:
      
        static void dma_pte_free_level(...
      	...
      
      	if (!(0 > 0 || 0x1ff < 0 + 0x200)) {
      		...
      	}
      
      Clearly the 2nd test is true, which means we fail to take the branch to
      clear and free the pagetable entry.  As a result, we're leaking
      pagetables and failing to install new pages over the range.
      
      This was found with a PCI device assigned to a QEMU guest using vfio-pci
      without a VGA device present.  The first 1M of guest address space is
      mapped with various combinations of 4K pages, but eventually the range
      is entirely freed and replaced with a 2M contiguous mapping.
      intel-iommu errors out with something like:
      
        ERROR: DMA PTE for vPFN 0x0 already set (to 5c2b8003 not 849c00083)
      
      In this case 5c2b8003 is the pointer to the previous leaf page that was
      neither freed nor cleared and 849c00083 is the superpage entry that
      we're trying to replace it with.
      Signed-off-by: default avatarAlex Williamson <alex.williamson@redhat.com>
      Cc: David Woodhouse <dwmw2@infradead.org>
      Cc: Joerg Roedel <joro@8bytes.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      32df365d
    • Wanlong Gao's avatar
      arch/sh/kernel/kgdb.c: add missing #include <linux/sched.h> · 25f43284
      Wanlong Gao authored
      commit 53a52f17 upstream.
      
        arch/sh/kernel/kgdb.c: In function 'sleeping_thread_to_gdb_regs':
        arch/sh/kernel/kgdb.c:225:32: error: implicit declaration of function 'task_stack_page' [-Werror=implicit-function-declaration]
        arch/sh/kernel/kgdb.c:242:23: error: dereferencing pointer to incomplete type
        arch/sh/kernel/kgdb.c:243:22: error: dereferencing pointer to incomplete type
        arch/sh/kernel/kgdb.c: In function 'singlestep_trap_handler':
        arch/sh/kernel/kgdb.c:310:27: error: 'SIGTRAP' undeclared (first use in this function)
        arch/sh/kernel/kgdb.c:310:27: note: each undeclared identifier is reported only once for each function it appears in
      
      This was introduced by commit 16559ae4 ("kgdb: remove #include
      <linux/serial_8250.h> from kgdb.h").
      
      [geert@linux-m68k.org: reworded and reformatted]
      Signed-off-by: default avatarWanlong Gao <gaowanlong@cn.fujitsu.com>
      Signed-off-by: default avatarGeert Uytterhoeven <geert+renesas@linux-m68k.org>
      Reported-by: default avatarFengguang Wu <fengguang.wu@intel.com>
      Acked-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      25f43284
    • Steven Rostedt (Red Hat)'s avatar
      tracing: Check if tracing is enabled in trace_puts() · f74bb740
      Steven Rostedt (Red Hat) authored
      commit 3132e107 upstream.
      
      If trace_puts() is used very early in boot up, it can crash the machine
      if it is called before the ring buffer is allocated. If a trace_printk()
      is used with no arguments, then it will be converted into a trace_puts()
      and suffer the same fate.
      
      Fixes: 09ae7234 "tracing: Add trace_puts() for even faster trace_printk() tracing"
      Signed-off-by: default avatarSteven Rostedt <rostedt@goodmis.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      f74bb740
    • Steven Rostedt (Red Hat)'s avatar
      tracing: Have trace buffer point back to trace_array · fb23eaf4
      Steven Rostedt (Red Hat) authored
      commit dced341b upstream.
      
      The trace buffer has a descriptor pointer that goes back to the trace
      array. But it was never assigned. Luckily, nothing uses it (yet), but
      it will in the future.
      
      Although nothing currently uses this, if any of the new features get
      backported to older kernels, and because this is such a simple change,
      I'm marking it for stable too.
      
      Fixes: 12883efb "tracing: Consolidate max_tr into main trace_array structure"
      Signed-off-by: default avatarSteven Rostedt <rostedt@goodmis.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      fb23eaf4
    • Tetsuo Handa's avatar
      SELinux: Fix memory leak upon loading policy · f6333f55
      Tetsuo Handa authored
      commit 8ed81460 upstream.
      
      Hello.
      
      I got below leak with linux-3.10.0-54.0.1.el7.x86_64 .
      
      [  681.903890] kmemleak: 5538 new suspected memory leaks (see /sys/kernel/debug/kmemleak)
      
      Below is a patch, but I don't know whether we need special handing for undoing
      ebitmap_set_bit() call.
      ----------
      >>From fe97527a90fe95e2239dfbaa7558f0ed559c0992 Mon Sep 17 00:00:00 2001
      From: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
      Date: Mon, 6 Jan 2014 16:30:21 +0900
      Subject: SELinux: Fix memory leak upon loading policy
      
      Commit 2463c26d "SELinux: put name based create rules in a hashtable" did not
      check return value from hashtab_insert() in filename_trans_read(). It leaks
      memory if hashtab_insert() returns error.
      
        unreferenced object 0xffff88005c9160d0 (size 8):
          comm "systemd", pid 1, jiffies 4294688674 (age 235.265s)
          hex dump (first 8 bytes):
            57 0b 00 00 6b 6b 6b a5                          W...kkk.
          backtrace:
            [<ffffffff816604ae>] kmemleak_alloc+0x4e/0xb0
            [<ffffffff811cba5e>] kmem_cache_alloc_trace+0x12e/0x360
            [<ffffffff812aec5d>] policydb_read+0xd1d/0xf70
            [<ffffffff812b345c>] security_load_policy+0x6c/0x500
            [<ffffffff812a623c>] sel_write_load+0xac/0x750
            [<ffffffff811eb680>] vfs_write+0xc0/0x1f0
            [<ffffffff811ec08c>] SyS_write+0x4c/0xa0
            [<ffffffff81690419>] system_call_fastpath+0x16/0x1b
            [<ffffffffffffffff>] 0xffffffffffffffff
      
      However, we should not return EEXIST error to the caller, or the systemd will
      show below message and the boot sequence freezes.
      
        systemd[1]: Failed to load SELinux policy. Freezing.
      Signed-off-by: default avatarTetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
      Acked-by: default avatarEric Paris <eparis@redhat.com>
      Signed-off-by: default avatarPaul Moore <pmoore@redhat.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      f6333f55
  2. 06 Feb, 2014 28 commits