1. 15 Sep, 2017 40 commits
    • Linus Torvalds's avatar
      Sanitize 'move_pages()' permission checks · b5a16892
      Linus Torvalds authored
      commit 197e7e52 upstream.
      
      The 'move_paghes()' system call was introduced long long ago with the
      same permission checks as for sending a signal (except using
      CAP_SYS_NICE instead of CAP_SYS_KILL for the overriding capability).
      
      That turns out to not be a great choice - while the system call really
      only moves physical page allocations around (and you need other
      capabilities to do a lot of it), you can check the return value to map
      out some the virtual address choices and defeat ASLR of a binary that
      still shares your uid.
      
      So change the access checks to the more common 'ptrace_may_access()'
      model instead.
      
      This tightens the access checks for the uid, and also effectively
      changes the CAP_SYS_NICE check to CAP_SYS_PTRACE, but it's unlikely that
      anybody really _uses_ this legacy system call any more (we hav ebetter
      NUMA placement models these days), so I expect nobody to notice.
      
      Famous last words.
      Reported-by: default avatarOtto Ebeling <otto.ebeling@iki.fi>
      Acked-by: default avatarEric W. Biederman <ebiederm@xmission.com>
      Cc: Willy Tarreau <w@1wt.eu>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      [bwh: Backported to 3.16: adjust context]
      Signed-off-by: default avatarBen Hutchings <ben@decadent.org.uk>
      b5a16892
    • Wei Wang's avatar
      tcp: initialize rcv_mss to TCP_MIN_MSS instead of 0 · 32cb2d4a
      Wei Wang authored
      commit 499350a5 upstream.
      
      When tcp_disconnect() is called, inet_csk_delack_init() sets
      icsk->icsk_ack.rcv_mss to 0.
      This could potentially cause tcp_recvmsg() => tcp_cleanup_rbuf() =>
      __tcp_select_window() call path to have division by 0 issue.
      So this patch initializes rcv_mss to TCP_MIN_MSS instead of 0.
      Reported-by: default avatarAndrey Konovalov  <andreyknvl@google.com>
      Signed-off-by: default avatarWei Wang <weiwan@google.com>
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Signed-off-by: default avatarNeal Cardwell <ncardwell@google.com>
      Signed-off-by: default avatarYuchung Cheng <ycheng@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarBen Hutchings <ben@decadent.org.uk>
      32cb2d4a
    • Roger Pau Monne's avatar
      xen: fix bio vec merging · f664b011
      Roger Pau Monne authored
      commit 462cdace upstream.
      
      The current test for bio vec merging is not fully accurate and can be
      tricked into merging bios when certain grant combinations are used.
      The result of these malicious bio merges is a bio that extends past
      the memory page used by any of the originating bios.
      
      Take into account the following scenario, where a guest creates two
      grant references that point to the same mfn, ie: grant 1 -> mfn A,
      grant 2 -> mfn A.
      
      These references are then used in a PV block request, and mapped by
      the backend domain, thus obtaining two different pfns that point to
      the same mfn, pfn B -> mfn A, pfn C -> mfn A.
      
      If those grants happen to be used in two consecutive sectors of a disk
      IO operation becoming two different bios in the backend domain, the
      checks in xen_biovec_phys_mergeable will succeed, because bfn1 == bfn2
      (they both point to the same mfn). However due to the bio merging,
      the backend domain will end up with a bio that expands past mfn A into
      mfn A + 1.
      
      Fix this by making sure the check in xen_biovec_phys_mergeable takes
      into account the offset and the length of the bio, this basically
      replicates whats done in __BIOVEC_PHYS_MERGEABLE using mfns (bus
      addresses). While there also remove the usage of
      __BIOVEC_PHYS_MERGEABLE, since that's already checked by the callers
      of xen_biovec_phys_mergeable.
      Reported-by: default avatar"Jan H. Schönherr" <jschoenh@amazon.de>
      Signed-off-by: default avatarRoger Pau Monné <roger.pau@citrix.com>
      Reviewed-by: default avatarJuergen Gross <jgross@suse.com>
      Signed-off-by: default avatarKonrad Rzeszutek Wilk <konrad.wilk@oracle.com>
      [bwh: Backported to 3.16:
       - s/bfn/mfn/g
       - Adjust context]
      Signed-off-by: default avatarBen Hutchings <ben@decadent.org.uk>
      f664b011
    • Vladis Dronov's avatar
      xfrm: policy: check policy direction value · 60166dc9
      Vladis Dronov authored
      commit 7bab0963 upstream.
      
      The 'dir' parameter in xfrm_migrate() is a user-controlled byte which is used
      as an array index. This can lead to an out-of-bound access, kernel lockup and
      DoS. Add a check for the 'dir' value.
      
      This fixes CVE-2017-11600.
      
      References: https://bugzilla.redhat.com/show_bug.cgi?id=1474928
      Fixes: 80c9abaa ("[XFRM]: Extension for dynamic update of endpoint address(es)")
      Reported-by: default avatar"bo Zhang" <zhangbo5891001@gmail.com>
      Signed-off-by: default avatarVladis Dronov <vdronov@redhat.com>
      Signed-off-by: default avatarSteffen Klassert <steffen.klassert@secunet.com>
      Signed-off-by: default avatarBen Hutchings <ben@decadent.org.uk>
      60166dc9
    • Arend van Spriel's avatar
      brcmfmac: fix possible buffer overflow in brcmf_cfg80211_mgmt_tx() · c63048a2
      Arend van Spriel authored
      commit 8f44c9a4 upstream.
      
      The lower level nl80211 code in cfg80211 ensures that "len" is between
      25 and NL80211_ATTR_FRAME (2304).  We subtract DOT11_MGMT_HDR_LEN (24) from
      "len" so thats's max of 2280.  However, the action_frame->data[] buffer is
      only BRCMF_FIL_ACTION_FRAME_SIZE (1800) bytes long so this memcpy() can
      overflow.
      
      	memcpy(action_frame->data, &buf[DOT11_MGMT_HDR_LEN],
      	       le16_to_cpu(action_frame->len));
      
      Fixes: 18e2f61d ("brcmfmac: P2P action frame tx.")
      Reported-by: default avatar"freenerguo(郭大兴)" <freenerguo@tencent.com>
      Signed-off-by: default avatarArend van Spriel <arend.vanspriel@broadcom.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      [bwh: Backported to 3.16: adjust filename]
      Signed-off-by: default avatarBen Hutchings <ben@decadent.org.uk>
      c63048a2
    • Jann Horn's avatar
      ptrace: use fsuid, fsgid, effective creds for fs access checks · 229aba44
      Jann Horn authored
      commit caaee623 upstream.
      
      By checking the effective credentials instead of the real UID / permitted
      capabilities, ensure that the calling process actually intended to use its
      credentials.
      
      To ensure that all ptrace checks use the correct caller credentials (e.g.
      in case out-of-tree code or newly added code omits the PTRACE_MODE_*CREDS
      flag), use two new flags and require one of them to be set.
      
      The problem was that when a privileged task had temporarily dropped its
      privileges, e.g.  by calling setreuid(0, user_uid), with the intent to
      perform following syscalls with the credentials of a user, it still passed
      ptrace access checks that the user would not be able to pass.
      
      While an attacker should not be able to convince the privileged task to
      perform a ptrace() syscall, this is a problem because the ptrace access
      check is reused for things in procfs.
      
      In particular, the following somewhat interesting procfs entries only rely
      on ptrace access checks:
      
       /proc/$pid/stat - uses the check for determining whether pointers
           should be visible, useful for bypassing ASLR
       /proc/$pid/maps - also useful for bypassing ASLR
       /proc/$pid/cwd - useful for gaining access to restricted
           directories that contain files with lax permissions, e.g. in
           this scenario:
           lrwxrwxrwx root root /proc/13020/cwd -> /root/foobar
           drwx------ root root /root
           drwxr-xr-x root root /root/foobar
           -rw-r--r-- root root /root/foobar/secret
      
      Therefore, on a system where a root-owned mode 6755 binary changes its
      effective credentials as described and then dumps a user-specified file,
      this could be used by an attacker to reveal the memory layout of root's
      processes or reveal the contents of files he is not allowed to access
      (through /proc/$pid/cwd).
      
      [akpm@linux-foundation.org: fix warning]
      Signed-off-by: default avatarJann Horn <jann@thejh.net>
      Acked-by: default avatarKees Cook <keescook@chromium.org>
      Cc: Casey Schaufler <casey@schaufler-ca.com>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: James Morris <james.l.morris@oracle.com>
      Cc: "Serge E. Hallyn" <serge.hallyn@ubuntu.com>
      Cc: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: "Eric W. Biederman" <ebiederm@xmission.com>
      Cc: Willy Tarreau <w@1wt.eu>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      [bwh: Backported to 3.16:
       - Update mm_access() calls in fs/proc/task_{,no}mmu.c too
       - Adjust context]
      Signed-off-by: default avatarBen Hutchings <ben@decadent.org.uk>
      229aba44
    • Sabrina Dubroca's avatar
      tracing/kprobes: Allow to create probe with a module name starting with a digit · 66252445
      Sabrina Dubroca authored
      commit 9e52b325 upstream.
      
      Always try to parse an address, since kstrtoul() will safely fail when
      given a symbol as input. If that fails (which will be the case for a
      symbol), try to parse a symbol instead.
      
      This allows creating a probe such as:
      
          p:probe/vlan_gro_receive 8021q:vlan_gro_receive+0
      
      Which is necessary for this command to work:
      
          perf probe -m 8021q -a vlan_gro_receive
      
      Link: http://lkml.kernel.org/r/fd72d666f45b114e2c5b9cf7e27b91de1ec966f1.1498122881.git.sd@queasysnail.net
      
      Fixes: 413d37d1 ("tracing: Add kprobe-based event tracer")
      Acked-by: default avatarMasami Hiramatsu <mhiramat@kernel.org>
      Signed-off-by: default avatarSabrina Dubroca <sd@queasysnail.net>
      Signed-off-by: default avatarSteven Rostedt (VMware) <rostedt@goodmis.org>
      [bwh: Backported to 3.16: preserve the check that an addresses isn't used for
       a kretprobe]
      Signed-off-by: default avatarBen Hutchings <ben@decadent.org.uk>
      66252445
    • James Hogan's avatar
      MIPS: Avoid accidental raw backtrace · 86c2e149
      James Hogan authored
      commit 85423636 upstream.
      
      Since commit 81a76d71 ("MIPS: Avoid using unwind_stack() with
      usermode") show_backtrace() invokes the raw backtracer when
      cp0_status & ST0_KSU indicates user mode to fix issues on EVA kernels
      where user and kernel address spaces overlap.
      
      However this is used by show_stack() which creates its own pt_regs on
      the stack and leaves cp0_status uninitialised in most of the code paths.
      This results in the non deterministic use of the raw back tracer
      depending on the previous stack content.
      
      show_stack() deals exclusively with kernel mode stacks anyway, so
      explicitly initialise regs.cp0_status to KSU_KERNEL (i.e. 0) to ensure
      we get a useful backtrace.
      
      Fixes: 81a76d71 ("MIPS: Avoid using unwind_stack() with usermode")
      Signed-off-by: default avatarJames Hogan <james.hogan@imgtec.com>
      Cc: linux-mips@linux-mips.org
      Patchwork: https://patchwork.linux-mips.org/patch/16656/Signed-off-by: default avatarRalf Baechle <ralf@linux-mips.org>
      Signed-off-by: default avatarBen Hutchings <ben@decadent.org.uk>
      86c2e149
    • Paul Burton's avatar
      MIPS: Fix IRQ tracing & lockdep when rescheduling · ef1778b8
      Paul Burton authored
      commit d8550860 upstream.
      
      When the scheduler sets TIF_NEED_RESCHED & we call into the scheduler
      from arch/mips/kernel/entry.S we disable interrupts. This is true
      regardless of whether we reach work_resched from syscall_exit_work,
      resume_userspace or by looping after calling schedule(). Although we
      disable interrupts in these paths we don't call trace_hardirqs_off()
      before calling into C code which may acquire locks, and we therefore
      leave lockdep with an inconsistent view of whether interrupts are
      disabled or not when CONFIG_PROVE_LOCKING & CONFIG_DEBUG_LOCKDEP are
      both enabled.
      
      Without tracing this interrupt state lockdep will print warnings such
      as the following once a task returns from a syscall via
      syscall_exit_partial with TIF_NEED_RESCHED set:
      
      [   49.927678] ------------[ cut here ]------------
      [   49.934445] WARNING: CPU: 0 PID: 1 at kernel/locking/lockdep.c:3687 check_flags.part.41+0x1dc/0x1e8
      [   49.946031] DEBUG_LOCKS_WARN_ON(current->hardirqs_enabled)
      [   49.946355] CPU: 0 PID: 1 Comm: init Not tainted 4.10.0-00439-gc9fd5d362289-dirty #197
      [   49.963505] Stack : 0000000000000000 ffffffff81bb5d6a 0000000000000006 ffffffff801ce9c4
      [   49.974431]         0000000000000000 0000000000000000 0000000000000000 000000000000004a
      [   49.985300]         ffffffff80b7e487 ffffffff80a24498 a8000000ff160000 ffffffff80ede8b8
      [   49.996194]         0000000000000001 0000000000000000 0000000000000000 0000000077c8030c
      [   50.007063]         000000007fd8a510 ffffffff801cd45c 0000000000000000 a8000000ff127c88
      [   50.017945]         0000000000000000 ffffffff801cf928 0000000000000001 ffffffff80a24498
      [   50.028827]         0000000000000000 0000000000000001 0000000000000000 0000000000000000
      [   50.039688]         0000000000000000 a8000000ff127bd0 0000000000000000 ffffffff805509bc
      [   50.050575]         00000000140084e0 0000000000000000 0000000000000000 0000000000040a00
      [   50.061448]         0000000000000000 ffffffff8010e1b0 0000000000000000 ffffffff805509bc
      [   50.072327]         ...
      [   50.076087] Call Trace:
      [   50.079869] [<ffffffff8010e1b0>] show_stack+0x80/0xa8
      [   50.086577] [<ffffffff805509bc>] dump_stack+0x10c/0x190
      [   50.093498] [<ffffffff8015dde0>] __warn+0xf0/0x108
      [   50.099889] [<ffffffff8015de34>] warn_slowpath_fmt+0x3c/0x48
      [   50.107241] [<ffffffff801c15b4>] check_flags.part.41+0x1dc/0x1e8
      [   50.114961] [<ffffffff801c239c>] lock_is_held_type+0x8c/0xb0
      [   50.122291] [<ffffffff809461b8>] __schedule+0x8c0/0x10f8
      [   50.129221] [<ffffffff80946a60>] schedule+0x30/0x98
      [   50.135659] [<ffffffff80106278>] work_resched+0x8/0x34
      [   50.142397] ---[ end trace 0cb4f6ef5b99fe21 ]---
      [   50.148405] possible reason: unannotated irqs-off.
      [   50.154600] irq event stamp: 400463
      [   50.159566] hardirqs last  enabled at (400463): [<ffffffff8094edc8>] _raw_spin_unlock_irqrestore+0x40/0xa8
      [   50.171981] hardirqs last disabled at (400462): [<ffffffff8094eb98>] _raw_spin_lock_irqsave+0x30/0xb0
      [   50.183897] softirqs last  enabled at (400450): [<ffffffff8016580c>] __do_softirq+0x4ac/0x6a8
      [   50.195015] softirqs last disabled at (400425): [<ffffffff80165e78>] irq_exit+0x110/0x128
      
      Fix this by using the TRACE_IRQS_OFF macro to call trace_hardirqs_off()
      when CONFIG_TRACE_IRQFLAGS is enabled. This is done before invoking
      schedule() following the work_resched label because:
      
       1) Interrupts are disabled regardless of the path we take to reach
          work_resched() & schedule().
      
       2) Performing the tracing here avoids the need to do it in paths which
          disable interrupts but don't call out to C code before hitting a
          path which uses the RESTORE_SOME macro that will call
          trace_hardirqs_on() or trace_hardirqs_off() as appropriate.
      
      We call trace_hardirqs_on() using the TRACE_IRQS_ON macro before calling
      syscall_trace_leave() for similar reasons, ensuring that lockdep has a
      consistent view of state after we re-enable interrupts.
      Signed-off-by: default avatarPaul Burton <paul.burton@imgtec.com>
      Fixes: 1da177e4 ("Linux-2.6.12-rc2")
      Cc: linux-mips@linux-mips.org
      Patchwork: https://patchwork.linux-mips.org/patch/15385/Signed-off-by: default avatarRalf Baechle <ralf@linux-mips.org>
      [bwh: Backported to 3.16: adjust context]
      Signed-off-by: default avatarBen Hutchings <ben@decadent.org.uk>
      ef1778b8
    • Paul Burton's avatar
      MIPS: pm-cps: Drop manual cache-line alignment of ready_count · 0c5bda34
      Paul Burton authored
      commit 161c51cc upstream.
      
      We allocate memory for a ready_count variable per-CPU, which is accessed
      via a cached non-coherent TLB mapping to perform synchronisation between
      threads within the core using LL/SC instructions. In order to ensure
      that the variable is contained within its own data cache line we
      allocate 2 lines worth of memory & align the resulting pointer to a line
      boundary. This is however unnecessary, since kmalloc is guaranteed to
      return memory which is at least cache-line aligned (see
      ARCH_DMA_MINALIGN). Stop the redundant manual alignment.
      
      Besides cleaning up the code & avoiding needless work, this has the side
      effect of avoiding an arithmetic error found by Bryan on 64 bit systems
      due to the 32 bit size of the former dlinesz. This led the ready_count
      variable to have its upper 32b cleared erroneously for MIPS64 kernels,
      causing problems when ready_count was later used on MIPS64 via cpuidle.
      Signed-off-by: default avatarPaul Burton <paul.burton@imgtec.com>
      Fixes: 3179d37e ("MIPS: pm-cps: add PM state entry code for CPS systems")
      Reported-by: default avatarBryan O'Donoghue <bryan.odonoghue@imgtec.com>
      Reviewed-by: default avatarBryan O'Donoghue <bryan.odonoghue@imgtec.com>
      Tested-by: default avatarBryan O'Donoghue <bryan.odonoghue@imgtec.com>
      Cc: linux-mips@linux-mips.org
      Patchwork: https://patchwork.linux-mips.org/patch/15383/Signed-off-by: default avatarRalf Baechle <ralf@linux-mips.org>
      Signed-off-by: default avatarBen Hutchings <ben@decadent.org.uk>
      0c5bda34
    • Doug Berger's avatar
      ARM: 8685/1: ensure memblock-limit is pmd-aligned · 603f239b
      Doug Berger authored
      commit 9e25ebfe upstream.
      
      The pmd containing memblock_limit is cleared by prepare_page_table()
      which creates the opportunity for early_alloc() to allocate unmapped
      memory if memblock_limit is not pmd aligned causing a boot-time hang.
      
      Commit 965278dc ("ARM: 8356/1: mm: handle non-pmd-aligned end of RAM")
      attempted to resolve this problem, but there is a path through the
      adjust_lowmem_bounds() routine where if all memory regions start and
      end on pmd-aligned addresses the memblock_limit will be set to
      arm_lowmem_limit.
      
      Since arm_lowmem_limit can be affected by the vmalloc early parameter,
      the value of arm_lowmem_limit may not be pmd-aligned. This commit
      corrects this oversight such that memblock_limit is always rounded
      down to pmd-alignment.
      
      Fixes: 965278dc ("ARM: 8356/1: mm: handle non-pmd-aligned end of RAM")
      Signed-off-by: default avatarDoug Berger <opendmb@gmail.com>
      Suggested-by: default avatarMark Rutland <mark.rutland@arm.com>
      Signed-off-by: default avatarRussell King <rmk+kernel@armlinux.org.uk>
      Signed-off-by: default avatarBen Hutchings <ben@decadent.org.uk>
      603f239b
    • Michal Kubeček's avatar
      net: handle NAPI_GRO_FREE_STOLEN_HEAD case also in napi_frags_finish() · 96f30d3e
      Michal Kubeček authored
      commit e44699d2 upstream.
      
      Recently I started seeing warnings about pages with refcount -1. The
      problem was traced to packets being reused after their head was merged into
      a GRO packet by skb_gro_receive(). While bisecting the issue pointed to
      commit c21b48cc ("net: adjust skb->truesize in ___pskb_trim()") and
      I have never seen it on a kernel with it reverted, I believe the real
      problem appeared earlier when the option to merge head frag in GRO was
      implemented.
      
      Handling NAPI_GRO_FREE_STOLEN_HEAD state was only added to GRO_MERGED_FREE
      branch of napi_skb_finish() so that if the driver uses napi_gro_frags()
      and head is merged (which in my case happens after the skb_condense()
      call added by the commit mentioned above), the skb is reused including the
      head that has been merged. As a result, we release the page reference
      twice and eventually end up with negative page refcount.
      
      To fix the problem, handle NAPI_GRO_FREE_STOLEN_HEAD in napi_frags_finish()
      the same way it's done in napi_skb_finish().
      
      Fixes: d7e8883c ("net: make GRO aware of skb->head_frag")
      Signed-off-by: default avatarMichal Kubecek <mkubecek@suse.cz>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      [bwh: Backported to 3.16: The necessary cleanup is just kmem_cache_free(),
       so don't bother adding a function for this.]
      Signed-off-by: default avatarBen Hutchings <ben@decadent.org.uk>
      96f30d3e
    • Hui Wang's avatar
      ALSA: hda - set input_path bitmap to zero after moving it to new place · 5a74f081
      Hui Wang authored
      commit a8f20fd2 upstream.
      
      Recently we met a problem, the codec has valid adcs and input pins,
      and they can form valid input paths, but the driver does not build
      valid controls for them like "Mic boost", "Capture Volume" and
      "Capture Switch".
      
      Through debugging, I found the driver needs to shrink the invalid
      adcs and input paths for this machine, so it will move the whole
      column bitmap value to the previous column, after moving it, the
      driver forgets to set the original column bitmap value to zero, as a
      result, the driver will invalidate the path whose index value is the
      original colume bitmap value. After executing this function, all
      valid input paths are invalidated by a mistake, there are no any
      valid input paths, so the driver won't build controls for them.
      
      Fixes: 3a65bcdc ("ALSA: hda - Fix inconsistent input_paths after ADC reduction")
      Signed-off-by: default avatarHui Wang <hui.wang@canonical.com>
      Signed-off-by: default avatarTakashi Iwai <tiwai@suse.de>
      Signed-off-by: default avatarBen Hutchings <ben@decadent.org.uk>
      5a74f081
    • Eric Dumazet's avatar
      net: prevent sign extension in dev_get_stats() · 6956b55d
      Eric Dumazet authored
      commit 6f64ec74 upstream.
      
      Similar to the fix provided by Dominik Heidler in commit
      9b3dc0a1 ("l2tp: cast l2tp traffic counter to unsigned")
      we need to take care of 32bit kernels in dev_get_stats().
      
      When using atomic_long_read(), we add a 'long' to u64 and
      might misinterpret high order bit, unless we cast to unsigned.
      
      Fixes: caf586e5 ("net: add a core netdev->rx_dropped counter")
      Fixes: 015f0688 ("net: net: add a core netdev->tx_dropped counter")
      Fixes: 6e7333d3 ("net: add rx_nohandler stat counter")
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Cc: Jarod Wilson <jarod@redhat.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      [bwh: Backported to 3.16: only {rx,tx}_dropped are updated here]
      Signed-off-by: default avatarBen Hutchings <ben@decadent.org.uk>
      6956b55d
    • WANG Cong's avatar
      tcp: reset sk_rx_dst in tcp_disconnect() · 58347335
      WANG Cong authored
      commit d747a7a5 upstream.
      
      We have to reset the sk->sk_rx_dst when we disconnect a TCP
      connection, because otherwise when we re-connect it this
      dst reference is simply overridden in tcp_finish_connect().
      
      This fixes a dst leak which leads to a loopback dev refcnt
      leak. It is a long-standing bug, Kevin reported a very similar
      (if not same) bug before. Thanks to Andrei for providing such
      a reliable reproducer which greatly narrows down the problem.
      
      Fixes: 41063e9d ("ipv4: Early TCP socket demux.")
      Reported-by: default avatarAndrei Vagin <avagin@gmail.com>
      Reported-by: default avatarKevin Xu <kaiwen.xu@hulu.com>
      Signed-off-by: default avatarCong Wang <xiyou.wangcong@gmail.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      [bwh: Backported to 3.16: adjust context]
      Signed-off-by: default avatarBen Hutchings <ben@decadent.org.uk>
      58347335
    • Ilya Matveychikov's avatar
      lib/cmdline.c: fix get_options() overflow while parsing ranges · 59d9b0eb
      Ilya Matveychikov authored
      commit a91e0f68 upstream.
      
      When using get_options() it's possible to specify a range of numbers,
      like 1-100500.  The problem is that it doesn't track array size while
      calling internally to get_range() which iterates over the range and
      fills the memory with numbers.
      
      Link: http://lkml.kernel.org/r/2613C75C-B04D-4BFF-82A6-12F97BA0F620@gmail.comSigned-off-by: default avatarIlya V. Matveychikov <matvejchikov@gmail.com>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarBen Hutchings <ben@decadent.org.uk>
      59d9b0eb
    • NeilBrown's avatar
      autofs: sanity check status reported with AUTOFS_DEV_IOCTL_FAIL · 6a09fc11
      NeilBrown authored
      commit 9fa4eb8e upstream.
      
      If a positive status is passed with the AUTOFS_DEV_IOCTL_FAIL ioctl,
      autofs4_d_automount() will return
      
         ERR_PTR(status)
      
      with that status to follow_automount(), which will then dereference an
      invalid pointer.
      
      So treat a positive status the same as zero, and map to ENOENT.
      
      See comment in systemd src/core/automount.c::automount_send_ready().
      
      Link: http://lkml.kernel.org/r/871sqwczx5.fsf@notabene.neil.brown.nameSigned-off-by: default avatarNeilBrown <neilb@suse.com>
      Cc: Ian Kent <raven@themaw.net>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarBen Hutchings <ben@decadent.org.uk>
      6a09fc11
    • Richard Cochran's avatar
      net: dp83640: Avoid NULL pointer dereference. · f4e204d8
      Richard Cochran authored
      commit db9d8b29 upstream.
      
      The function, skb_complete_tx_timestamp(), used to allow passing in a
      NULL pointer for the time stamps, but that was changed in commit
      62bccb8c ("net-timestamp: Make the
      clone operation stand-alone from phy timestamping"), and the existing
      call sites, all of which are in the dp83640 driver, were fixed up.
      
      Even though the kernel-doc was subsequently updated in commit
      7a76a021 ("net-timestamp: Update
      skb_complete_tx_timestamp comment"), still a bug fix from Manfred
      Rudigier came into the driver using the old semantics.  Probably
      Manfred derived that patch from an older kernel version.
      
      This fix should be applied to the stable trees as well.
      
      Fixes: 81e8f2e9 ("net: dp83640: Fix tx timestamp overflow handling.")
      Signed-off-by: default avatarRichard Cochran <richardcochran@gmail.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarBen Hutchings <ben@decadent.org.uk>
      f4e204d8
    • Michal Kubeček's avatar
      net: account for current skb length when deciding about UFO · 72a2cad7
      Michal Kubeček authored
      commit a5cb659b upstream.
      
      Our customer encountered stuck NFS writes for blocks starting at specific
      offsets w.r.t. page boundary caused by networking stack sending packets via
      UFO enabled device with wrong checksum. The problem can be reproduced by
      composing a long UDP datagram from multiple parts using MSG_MORE flag:
      
        sendto(sd, buff, 1000, MSG_MORE, ...);
        sendto(sd, buff, 1000, MSG_MORE, ...);
        sendto(sd, buff, 3000, 0, ...);
      
      Assume this packet is to be routed via a device with MTU 1500 and
      NETIF_F_UFO enabled. When second sendto() gets into __ip_append_data(),
      this condition is tested (among others) to decide whether to call
      ip_ufo_append_data():
      
        ((length + fragheaderlen) > mtu) || (skb && skb_is_gso(skb))
      
      At the moment, we already have skb with 1028 bytes of data which is not
      marked for GSO so that the test is false (fragheaderlen is usually 20).
      Thus we append second 1000 bytes to this skb without invoking UFO. Third
      sendto(), however, has sufficient length to trigger the UFO path so that we
      end up with non-UFO skb followed by a UFO one. Later on, udp_send_skb()
      uses udp_csum() to calculate the checksum but that assumes all fragments
      have correct checksum in skb->csum which is not true for UFO fragments.
      
      When checking against MTU, we need to add skb->len to length of new segment
      if we already have a partially filled skb and fragheaderlen only if there
      isn't one.
      
      In the IPv6 case, skb can only be null if this is the first segment so that
      we have to use headersize (length of the first IPv6 header) rather than
      fragheaderlen (length of IPv6 header of further fragments) for skb == NULL.
      
      Fixes: e89e9cf5 ("[IPv4/IPv6]: UFO Scatter-gather approach")
      Fixes: e4c5e13a ("ipv6: Should use consistent conditional judgement for
      	ip6 fragment between __ip6_append_data and ip6_finish_output")
      Signed-off-by: default avatarMichal Kubecek <mkubecek@suse.cz>
      Acked-by: default avatarVlad Yasevich <vyasevic@redhat.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      [bwh: Backported to 3.16:
       - Move headersize out of the if-statement in ip6_append_data() so it can be
         used for this
       - Adjust context to apply after "udp: consistently apply ufo or fragmentation"]
      Signed-off-by: default avatarBen Hutchings <ben@decadent.org.uk>
      72a2cad7
    • zheng li's avatar
      ipv4: Should use consistent conditional judgement for ip fragment in... · c52f61de
      zheng li authored
      ipv4: Should use consistent conditional judgement for ip fragment in __ip_append_data and ip_finish_output
      
      commit 0a28cfd5 upstream.
      
      There is an inconsistent conditional judgement in __ip_append_data and
      ip_finish_output functions, the variable length in __ip_append_data just
      include the length of application's payload and udp header, don't include
      the length of ip header, but in ip_finish_output use
      (skb->len > ip_skb_dst_mtu(skb)) as judgement, and skb->len include the
      length of ip header.
      
      That causes some particular application's udp payload whose length is
      between (MTU - IP Header) and MTU were fragmented by ip_fragment even
      though the rst->dev support UFO feature.
      
      Add the length of ip header to length in __ip_append_data to keep
      consistent conditional judgement as ip_finish_output for ip fragment.
      Signed-off-by: default avatarZheng Li <james.z.li@ericsson.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      [bwh: Backported to 3.16: adjust context to apply after "udp: consistently apply
       ufo or fragmentation"]
      Signed-off-by: default avatarBen Hutchings <ben@decadent.org.uk>
      c52f61de
    • Nicholas Piggin's avatar
      powerpc/64: Initialise thread_info for emergency stacks · cb2f131f
      Nicholas Piggin authored
      commit 34f19ff1 upstream.
      
      Emergency stacks have their thread_info mostly uninitialised, which in
      particular means garbage preempt_count values.
      
      Emergency stack code runs with interrupts disabled entirely, and is
      used very rarely, so this has been unnoticed so far. It was found by a
      proposed new powerpc watchdog that takes a soft-NMI directly from the
      masked_interrupt handler and using the emergency stack. That crashed
      at BUG_ON(in_nmi()) in nmi_enter(). preempt_count()s were found to be
      garbage.
      
      To fix this, zero the entire THREAD_SIZE allocation, and initialize
      the thread_info.
      Reported-by: default avatarAbdul Haleem <abdhalee@linux.vnet.ibm.com>
      Signed-off-by: default avatarNicholas Piggin <npiggin@gmail.com>
      [mpe: Move it all into setup_64.c, use a function not a macro. Fix
            crashes on Cell by setting preempt_count to 0 not HARDIRQ_OFFSET]
      Signed-off-by: default avatarMichael Ellerman <mpe@ellerman.id.au>
      [bwh: Backported to 3.16:
       - There are only two emergency stacks
       - No need to call klp_init_thread_info()
       - Add the ti variable in emergency_stack_init()]
      Signed-off-by: default avatarBen Hutchings <ben@decadent.org.uk>
      cb2f131f
    • WANG Cong's avatar
      ipv6: avoid unregistering inet6_dev for loopback · b7aba372
      WANG Cong authored
      commit 60abc0be upstream.
      
      The per netns loopback_dev->ip6_ptr is unregistered and set to
      NULL when its mtu is set to smaller than IPV6_MIN_MTU, this
      leads to that we could set rt->rt6i_idev NULL after a
      rt6_uncached_list_flush_dev() and then crash after another
      call.
      
      In this case we should just bring its inet6_dev down, rather
      than unregistering it, at least prior to commit 176c39af
      ("netns: fix addrconf_ifdown kernel panic") we always
      override the case for loopback.
      
      Thanks a lot to Andrey for finding a reliable reproducer.
      
      Fixes: 176c39af ("netns: fix addrconf_ifdown kernel panic")
      Reported-by: default avatarAndrey Konovalov <andreyknvl@google.com>
      Cc: Andrey Konovalov <andreyknvl@google.com>
      Cc: Daniel Lezcano <dlezcano@fr.ibm.com>
      Cc: David Ahern <dsahern@gmail.com>
      Signed-off-by: default avatarCong Wang <xiyou.wangcong@gmail.com>
      Acked-by: default avatarDavid Ahern <dsahern@gmail.com>
      Tested-by: default avatarAndrey Konovalov <andreyknvl@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      [bwh: Backported to 3.16: the NETDEV_CHANGEMTU case used to fall-through to the
       NETDEV_DOWN case here, so replace that with a separate call to addrconf_ifdown()]
      Signed-off-by: default avatarBen Hutchings <ben@decadent.org.uk>
      b7aba372
    • WANG Cong's avatar
      ipv6: only call ip6_route_dev_notify() once for NETDEV_UNREGISTER · 29530235
      WANG Cong authored
      commit 76da0704 upstream.
      
      In commit 242d3a49 ("ipv6: reorder ip6_route_dev_notifier after ipv6_dev_notf")
      I assumed NETDEV_REGISTER and NETDEV_UNREGISTER are paired,
      unfortunately, as reported by jeffy, netdev_wait_allrefs()
      could rebroadcast NETDEV_UNREGISTER event until all refs are
      gone.
      
      We have to add an additional check to avoid this corner case.
      For netdev_wait_allrefs() dev->reg_state is NETREG_UNREGISTERED,
      for dev_change_net_namespace(), dev->reg_state is
      NETREG_REGISTERED. So check for dev->reg_state != NETREG_UNREGISTERED.
      
      Fixes: 242d3a49 ("ipv6: reorder ip6_route_dev_notifier after ipv6_dev_notf")
      Reported-by: default avatarjeffy <jeffy.chen@rock-chips.com>
      Cc: David Ahern <dsahern@gmail.com>
      Signed-off-by: default avatarCong Wang <xiyou.wangcong@gmail.com>
      Acked-by: default avatarDavid Ahern <dsahern@gmail.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarBen Hutchings <ben@decadent.org.uk>
      29530235
    • WANG Cong's avatar
      ipv6: reorder ip6_route_dev_notifier after ipv6_dev_notf · 10b096d5
      WANG Cong authored
      commit 242d3a49 upstream.
      
      For each netns (except init_net), we initialize its null entry
      in 3 places:
      
      1) The template itself, as we use kmemdup()
      2) Code around dst_init_metrics() in ip6_route_net_init()
      3) ip6_route_dev_notify(), which is supposed to initialize it after
         loopback registers
      
      Unfortunately the last one still happens in a wrong order because
      we expect to initialize net->ipv6.ip6_null_entry->rt6i_idev to
      net->loopback_dev's idev, thus we have to do that after we add
      idev to loopback. However, this notifier has priority == 0 same as
      ipv6_dev_notf, and ipv6_dev_notf is registered after
      ip6_route_dev_notifier so it is called actually after
      ip6_route_dev_notifier. This is similar to commit 2f460933
      ("ipv6: initialize route null entry in addrconf_init()") which
      fixes init_net.
      
      Fix it by picking a smaller priority for ip6_route_dev_notifier.
      Also, we have to release the refcnt accordingly when unregistering
      loopback_dev because device exit functions are called before subsys
      exit functions.
      Acked-by: default avatarDavid Ahern <dsahern@gmail.com>
      Tested-by: default avatarDavid Ahern <dsahern@gmail.com>
      Signed-off-by: default avatarCong Wang <xiyou.wangcong@gmail.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarBen Hutchings <ben@decadent.org.uk>
      10b096d5
    • WANG Cong's avatar
      ipv6: initialize route null entry in addrconf_init() · 9cd815ca
      WANG Cong authored
      commit 2f460933 upstream.
      
      Andrey reported a crash on init_net.ipv6.ip6_null_entry->rt6i_idev
      since it is always NULL.
      
      This is clearly wrong, we have code to initialize it to loopback_dev,
      unfortunately the order is still not correct.
      
      loopback_dev is registered very early during boot, we lose a chance
      to re-initialize it in notifier. addrconf_init() is called after
      ip6_route_init(), which means we have no chance to correct it.
      
      Fix it by moving this initialization explicitly after
      ipv6_add_dev(init_net.loopback_dev) in addrconf_init().
      Reported-by: default avatarAndrey Konovalov <andreyknvl@google.com>
      Signed-off-by: default avatarCong Wang <xiyou.wangcong@gmail.com>
      Tested-by: default avatarAndrey Konovalov <andreyknvl@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarBen Hutchings <ben@decadent.org.uk>
      9cd815ca
    • Michail Georgios Etairidis's avatar
      i2c: imx: Use correct function to write to register · 1042c147
      Michail Georgios Etairidis authored
      commit 6c782a5e upstream.
      
      The i2c-imx driver incorrectly uses readb()/writeb() to read and
      write to the appropriate registers when performing a repeated start.
      The appropriate imx_i2c_read_reg()/imx_i2c_write_reg() functions
      should be used instead. Performing a repeated start results in
      a kernel panic. The platform is imx.
      Signed-off-by: default avatarMichail G Etairidis <m.etairidis@beck-ipc.com>
      Fixes: ce1a7884 ("i2c: imx: add DMA support for freescale i2c driver")
      Fixes: 054b62d9 ("i2c: imx: fix the i2c bus hang issue when do repeat restart")
      Acked-by: default avatarFugang Duan <fugang.duan@nxp.com>
      Acked-by: default avatarUwe Kleine-König <u.kleine-koenig@pengutronix.de>
      Signed-off-by: default avatarWolfram Sang <wsa@the-dreams.de>
      [bwh: Backported to 3.16: drop changes in i2c_imx_dma_read()]
      Signed-off-by: default avatarBen Hutchings <ben@decadent.org.uk>
      1042c147
    • Pavel Shilovsky's avatar
      CIFS: Improve readdir verbosity · 6ed510ce
      Pavel Shilovsky authored
      commit dcd87838 upstream.
      
      Downgrade the loglevel for SMB2 to prevent filling the log
      with messages if e.g. readdir was interrupted. Also make SMB2
      and SMB1 codepaths do the same logging during readdir.
      Signed-off-by: default avatarPavel Shilovsky <pshilov@microsoft.com>
      Signed-off-by: default avatarSteve French <smfrench@gmail.com>
      Signed-off-by: default avatarBen Hutchings <ben@decadent.org.uk>
      6ed510ce
    • Serhey Popovych's avatar
      rtnetlink: add IFLA_GROUP to ifla_policy · 33b4a77d
      Serhey Popovych authored
      commit db833d40 upstream.
      
      Network interface groups support added while ago, however
      there is no IFLA_GROUP attribute description in policy
      and netlink message size calculations until now.
      
      Add IFLA_GROUP attribute to the policy.
      
      Fixes: cbda10fa ("net_device: add support for network device groups")
      Signed-off-by: default avatarSerhey Popovych <serhe.popovych@gmail.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      [bwh: Backported to 3.16: adjust context]
      Signed-off-by: default avatarBen Hutchings <ben@decadent.org.uk>
      33b4a77d
    • Serhey Popovych's avatar
      ipv6: Do not leak throw route references · fff429ef
      Serhey Popovych authored
      commit 07f61557 upstream.
      
      While commit 73ba57bf ("ipv6: fix backtracking for throw routes")
      does good job on error propagation to the fib_rules_lookup()
      in fib rules core framework that also corrects throw routes
      handling, it does not solve route reference leakage problem
      happened when we return -EAGAIN to the fib_rules_lookup()
      and leave routing table entry referenced in arg->result.
      
      If rule with matched throw route isn't last matched in the
      list we overwrite arg->result losing reference on throw
      route stored previously forever.
      
      We also partially revert commit ab997ad4 ("ipv6: fix the
      incorrect return value of throw route") since we never return
      routing table entry with dst.error == -EAGAIN when
      CONFIG_IPV6_MULTIPLE_TABLES is on. Also there is no point
      to check for RTF_REJECT flag since it is always set throw
      route.
      
      Fixes: 73ba57bf ("ipv6: fix backtracking for throw routes")
      Signed-off-by: default avatarSerhey Popovych <serhe.popovych@gmail.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      [bwh: Backported to 3.16: commit ab997ad4 was never applied here and does
       not need to be reverted]
      Signed-off-by: default avatarBen Hutchings <ben@decadent.org.uk>
      fff429ef
    • Alex Deucher's avatar
      7b0c5298
    • Alex Deucher's avatar
      5f2c2b24
    • Daniel Drake's avatar
      Input: i8042 - add Fujitsu Lifebook AH544 to notimeout list · 01798710
      Daniel Drake authored
      commit 817ae460 upstream.
      
      Without this quirk, the touchpad is not responsive on this product, with
      the following message repeated in the logs:
      
       psmouse serio1: bad data from KBC - timeout
      
      Add it to the notimeout list alongside other similar Fujitsu laptops.
      Signed-off-by: default avatarDaniel Drake <drake@endlessm.com>
      Signed-off-by: default avatarDmitry Torokhov <dmitry.torokhov@gmail.com>
      Signed-off-by: default avatarBen Hutchings <ben@decadent.org.uk>
      01798710
    • Eric W. Biederman's avatar
      signal: Only reschedule timers on signals timers have sent · f3bcee2c
      Eric W. Biederman authored
      commit 57db7e4a upstream.
      
      Thomas Gleixner  wrote:
      > The CRIU support added a 'feature' which allows a user space task to send
      > arbitrary (kernel) signals to itself. The changelog says:
      >
      >   The kernel prevents sending of siginfo with positive si_code, because
      >   these codes are reserved for kernel.  I think we can allow a task to
      >   send such a siginfo to itself.  This operation should not be dangerous.
      >
      > Quite contrary to that claim, it turns out that it is outright dangerous
      > for signals with info->si_code == SI_TIMER. The following code sequence in
      > a user space task allows to crash the kernel:
      >
      >    id = timer_create(CLOCK_XXX, ..... signo = SIGX);
      >    timer_set(id, ....);
      >    info->si_signo = SIGX;
      >    info->si_code = SI_TIMER:
      >    info->_sifields._timer._tid = id;
      >    info->_sifields._timer._sys_private = 2;
      >    rt_[tg]sigqueueinfo(..., SIGX, info);
      >    sigemptyset(&sigset);
      >    sigaddset(&sigset, SIGX);
      >    rt_sigtimedwait(sigset, info);
      >
      > For timers based on CLOCK_PROCESS_CPUTIME_ID, CLOCK_THREAD_CPUTIME_ID this
      > results in a kernel crash because sigwait() dequeues the signal and the
      > dequeue code observes:
      >
      >   info->si_code == SI_TIMER && info->_sifields._timer._sys_private != 0
      >
      > which triggers the following callchain:
      >
      >  do_schedule_next_timer() -> posix_cpu_timer_schedule() -> arm_timer()
      >
      > arm_timer() executes a list_add() on the timer, which is already armed via
      > the timer_set() syscall. That's a double list add which corrupts the posix
      > cpu timer list. As a consequence the kernel crashes on the next operation
      > touching the posix cpu timer list.
      >
      > Posix clocks which are internally implemented based on hrtimers are not
      > affected by this because hrtimer_start() can handle already armed timers
      > nicely, but it's a reliable way to trigger the WARN_ON() in
      > hrtimer_forward(), which complains about calling that function on an
      > already armed timer.
      
      This problem has existed since the posix timer code was merged into
      2.5.63. A few releases earlier in 2.5.60 ptrace gained the ability to
      inject not just a signal (which linux has supported since 1.0) but the
      full siginfo of a signal.
      
      The core problem is that the code will reschedule in response to
      signals getting dequeued not just for signals the timers sent but
      for other signals that happen to a si_code of SI_TIMER.
      
      Avoid this confusion by testing to see if the queued signal was
      preallocated as all timer signals are preallocated, and so far
      only the timer code preallocates signals.
      
      Move the check for if a timer needs to be rescheduled up into
      collect_signal where the preallocation check must be performed,
      and pass the result back to dequeue_signal where the code reschedules
      timers.   This makes it clear why the code cares about preallocated
      timers.
      Reported-by: default avatarThomas Gleixner <tglx@linutronix.de>
      History Tree: https://git.kernel.org/pub/scm/linux/kernel/git/tglx/history.git
      Reference: 66dd34ad ("signal: allow to send any siginfo to itself")
      Reference: 1669ce53 ("Add PTRACE_GETSIGINFO and PTRACE_SETSIGINFO")
      Fixes: db8b50ba ("[PATCH] POSIX clocks & timers")
      Signed-off-by: default avatar"Eric W. Biederman" <ebiederm@xmission.com>
      [bwh: Backported to 3.16: adjust context]
      Signed-off-by: default avatarBen Hutchings <ben@decadent.org.uk>
      f3bcee2c
    • Mark Rutland's avatar
      mm: numa: avoid waiting on freed migrated pages · 5c2ca2d4
      Mark Rutland authored
      commit 3c226c63 upstream.
      
      In do_huge_pmd_numa_page(), we attempt to handle a migrating thp pmd by
      waiting until the pmd is unlocked before we return and retry.  However,
      we can race with migrate_misplaced_transhuge_page():
      
          // do_huge_pmd_numa_page                // migrate_misplaced_transhuge_page()
          // Holds 0 refs on page                 // Holds 2 refs on page
      
          vmf->ptl = pmd_lock(vma->vm_mm, vmf->pmd);
          /* ... */
          if (pmd_trans_migrating(*vmf->pmd)) {
                  page = pmd_page(*vmf->pmd);
                  spin_unlock(vmf->ptl);
                                                  ptl = pmd_lock(mm, pmd);
                                                  if (page_count(page) != 2)) {
                                                          /* roll back */
                                                  }
                                                  /* ... */
                                                  mlock_migrate_page(new_page, page);
                                                  /* ... */
                                                  spin_unlock(ptl);
                                                  put_page(page);
                                                  put_page(page); // page freed here
                  wait_on_page_locked(page);
                  goto out;
          }
      
      This can result in the freed page having its waiters flag set
      unexpectedly, which trips the PAGE_FLAGS_CHECK_AT_PREP checks in the
      page alloc/free functions.  This has been observed on arm64 KVM guests.
      
      We can avoid this by having do_huge_pmd_numa_page() take a reference on
      the page before dropping the pmd lock, mirroring what we do in
      __migration_entry_wait().
      
      When we hit the race, migrate_misplaced_transhuge_page() will see the
      reference and abort the migration, as it may do today in other cases.
      
      Fixes: b8916634 ("mm: Prevent parallel splits during THP migration")
      Link: http://lkml.kernel.org/r/1497349722-6731-2-git-send-email-will.deacon@arm.comSigned-off-by: default avatarMark Rutland <mark.rutland@arm.com>
      Signed-off-by: default avatarWill Deacon <will.deacon@arm.com>
      Acked-by: default avatarSteve Capper <steve.capper@arm.com>
      Acked-by: default avatarKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Acked-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Cc: Mel Gorman <mgorman@suse.de>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      [bwh: Backported to 3.16: adjust context]
      Signed-off-by: default avatarBen Hutchings <ben@decadent.org.uk>
      5c2ca2d4
    • Yu Zhao's avatar
      swap: cond_resched in swap_cgroup_prepare() · 0838b5cc
      Yu Zhao authored
      commit ef707629 upstream.
      
      I saw need_resched() warnings when swapping on large swapfile (TBs)
      because continuously allocating many pages in swap_cgroup_prepare() took
      too long.
      
      We already cond_resched when freeing page in swap_cgroup_swapoff().  Do
      the same for the page allocation.
      
      Link: http://lkml.kernel.org/r/20170604200109.17606-1-yuzhao@google.comSigned-off-by: default avatarYu Zhao <yuzhao@google.com>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Acked-by: default avatarVladimir Davydov <vdavydov.dev@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      [bwh: Backported to 3.16: adjust filename]
      Signed-off-by: default avatarBen Hutchings <ben@decadent.org.uk>
      0838b5cc
    • James Morse's avatar
      mm/memory-failure.c: use compound_head() flags for huge pages · 983bd66d
      James Morse authored
      commit 7258ae5c upstream.
      
      memory_failure() chooses a recovery action function based on the page
      flags.  For huge pages it uses the tail page flags which don't have
      anything interesting set, resulting in:
      
      > Memory failure: 0x9be3b4: Unknown page state
      > Memory failure: 0x9be3b4: recovery action for unknown page: Failed
      
      Instead, save a copy of the head page's flags if this is a huge page,
      this means if there are no relevant flags for this tail page, we use the
      head pages flags instead.  This results in the me_huge_page() recovery
      action being called:
      
      > Memory failure: 0x9b7969: recovery action for huge page: Delayed
      
      For hugepages that have not yet been allocated, this allows the hugepage
      to be dequeued.
      
      Fixes: 524fca1e ("HWPOISON: fix misjudgement of page_action() for errors on mlocked pages")
      Link: http://lkml.kernel.org/r/20170524130204.21845-1-james.morse@arm.comSigned-off-by: default avatarJames Morse <james.morse@arm.com>
      Tested-by: default avatarPunit Agrawal <punit.agrawal@arm.com>
      Acked-by: default avatarPunit Agrawal <punit.agrawal@arm.com>
      Acked-by: default avatarNaoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarBen Hutchings <ben@decadent.org.uk>
      983bd66d
    • Naveen N. Rao's avatar
      powerpc/kprobes: Pause function_graph tracing during jprobes handling · f48449f0
      Naveen N. Rao authored
      commit a9f8553e upstream.
      
      This fixes a crash when function_graph and jprobes are used together.
      This is essentially commit 237d28db ("ftrace/jprobes/x86: Fix
      conflict between jprobes and function graph tracing"), but for powerpc.
      
      Jprobes breaks function_graph tracing since the jprobe hook needs to use
      jprobe_return(), which never returns back to the hook, but instead to
      the original jprobe'd function. The solution is to momentarily pause
      function_graph tracing before invoking the jprobe hook and re-enable it
      when returning back to the original jprobe'd function.
      
      Fixes: 6794c782 ("powerpc64: port of the function graph tracer")
      Signed-off-by: default avatarNaveen N. Rao <naveen.n.rao@linux.vnet.ibm.com>
      Acked-by: default avatarMasami Hiramatsu <mhiramat@kernel.org>
      Acked-by: default avatarSteven Rostedt (VMware) <rostedt@goodmis.org>
      Signed-off-by: default avatarMichael Ellerman <mpe@ellerman.id.au>
      Signed-off-by: default avatarBen Hutchings <ben@decadent.org.uk>
      f48449f0
    • Liwei Song's avatar
      i2c: ismt: fix wrong device address when unmap the data buffer · 576c9404
      Liwei Song authored
      commit 17e83549 upstream.
      
      Fix the following kernel bug:
      
      kernel BUG at drivers/iommu/intel-iommu.c:3260!
      invalid opcode: 0000 [#5] PREEMPT SMP
      Hardware name: Intel Corp. Harcuvar/Server, BIOS HAVLCRB0.X64.0013.D39.1608311820 08/31/2016
      task: ffff880175389950 ti: ffff880176bec000 task.ti: ffff880176bec000
      RIP: 0010:[<ffffffff8150a83b>]  [<ffffffff8150a83b>] intel_unmap+0x25b/0x260
      RSP: 0018:ffff880176bef5e8  EFLAGS: 00010296
      RAX: 0000000000000024 RBX: ffff8800773c7c88 RCX: 000000000000ce04
      RDX: 0000000080000000 RSI: 0000000000000000 RDI: 0000000000000009
      RBP: ffff880176bef638 R08: 0000000000000010 R09: 0000000000000004
      R10: ffff880175389c78 R11: 0000000000000a4f R12: ffff8800773c7868
      R13: 00000000ffffac88 R14: ffff8800773c7818 R15: 0000000000000001
      FS:  00007fef21258700(0000) GS:ffff88017b5c0000(0000) knlGS:0000000000000000
      CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      CR2: 000000000066d6d8 CR3: 000000007118c000 CR4: 00000000003406e0
      Stack:
       00000000ffffac88 ffffffff8199867f ffff880176bef5f8 ffff880100000030
       ffff880176bef668 ffff8800773c7c88 ffff880178288098 ffff8800772c0010
       ffff8800773c7818 0000000000000001 ffff880176bef648 ffffffff8150a86e
      Call Trace:
       [<ffffffff8199867f>] ? printk+0x46/0x48
       [<ffffffff8150a86e>] intel_unmap_page+0xe/0x10
       [<ffffffffa039d99b>] ismt_access+0x27b/0x8fa [i2c_ismt]
       [<ffffffff81554420>] ? __pm_runtime_suspend+0xa0/0xa0
       [<ffffffff815544a0>] ? pm_suspend_timer_fn+0x80/0x80
       [<ffffffff81554420>] ? __pm_runtime_suspend+0xa0/0xa0
       [<ffffffff815544a0>] ? pm_suspend_timer_fn+0x80/0x80
       [<ffffffff8143dfd0>] ? pci_bus_read_dev_vendor_id+0xf0/0xf0
       [<ffffffff8172b36c>] i2c_smbus_xfer+0xec/0x4b0
       [<ffffffff810aa4d5>] ? vprintk_emit+0x345/0x530
       [<ffffffffa038936b>] i2cdev_ioctl_smbus+0x12b/0x240 [i2c_dev]
       [<ffffffff810aa829>] ? vprintk_default+0x29/0x40
       [<ffffffffa0389b33>] i2cdev_ioctl+0x63/0x1ec [i2c_dev]
       [<ffffffff811b04c8>] do_vfs_ioctl+0x328/0x5d0
       [<ffffffff8119d8ec>] ? vfs_write+0x11c/0x190
       [<ffffffff8109d449>] ? rt_up_read+0x19/0x20
       [<ffffffff811b07f1>] SyS_ioctl+0x81/0xa0
       [<ffffffff819a351b>] system_call_fastpath+0x16/0x6e
      
      This happen When run "i2cdetect -y 0" detect SMBus iSMT adapter.
      
      After finished I2C block read/write, when unmap the data buffer,
      a wrong device address was pass to dma_unmap_single().
      
      To fix this, give dma_unmap_single() the "dev" parameter, just like
      what dma_map_single() does, then unmap can find the right devices.
      
      Fixes: 13f35ac1 ("i2c: Adding support for Intel iSMT SMBus 2.0 host controller")
      Signed-off-by: default avatarLiwei Song <liwei.song@windriver.com>
      Reviewed-by: default avatarAndy Shevchenko <andy.shevchenko@gmail.com>
      Signed-off-by: default avatarWolfram Sang <wsa@the-dreams.de>
      Signed-off-by: default avatarBen Hutchings <ben@decadent.org.uk>
      576c9404
    • Paul Mackerras's avatar
      KVM: PPC: Book3S HV: Preserve userspace HTM state properly · 1433c019
      Paul Mackerras authored
      commit 46a704f8 upstream.
      
      If userspace attempts to call the KVM_RUN ioctl when it has hardware
      transactional memory (HTM) enabled, the values that it has put in the
      HTM-related SPRs TFHAR, TFIAR and TEXASR will get overwritten by
      guest values.  To fix this, we detect this condition and save those
      SPR values in the thread struct, and disable HTM for the task.  If
      userspace goes to access those SPRs or the HTM facility in future,
      a TM-unavailable interrupt will occur and the handler will reload
      those SPRs and re-enable HTM.
      
      If userspace has started a transaction and suspended it, we would
      currently lose the transactional state in the guest entry path and
      would almost certainly get a "TM Bad Thing" interrupt, which would
      cause the host to crash.  To avoid this, we detect this case and
      return from the KVM_RUN ioctl with an EINVAL error, with the KVM
      exit reason set to KVM_EXIT_FAIL_ENTRY.
      
      Fixes: b005255e ("KVM: PPC: Book3S HV: Context-switch new POWER8 SPRs", 2014-01-08)
      Signed-off-by: default avatarPaul Mackerras <paulus@ozlabs.org>
      Signed-off-by: default avatarBen Hutchings <ben@decadent.org.uk>
      1433c019
    • Feras Daoud's avatar
      IB/ipoib: Fix memory leak in create child syscall · ae5212ab
      Feras Daoud authored
      commit 4542d66b upstream.
      
      The flow of creating a new child goes through ipoib_vlan_add
      which allocates a new interface and checks the rtnl_lock.
      
      If the lock is taken, restart_syscall will be called to restart
      the system call again. In this case we are not releasing the
      already allocated interface, causing a leak.
      
      Fixes: 9baa0b03 ("IB/ipoib: Add rtnl_link_ops support")
      Signed-off-by: default avatarFeras Daoud <ferasda@mellanox.com>
      Signed-off-by: default avatarLeon Romanovsky <leon@kernel.org>
      Signed-off-by: default avatarDoug Ledford <dledford@redhat.com>
      [bwh: Backported to 3.16: adjust context]
      Signed-off-by: default avatarBen Hutchings <ben@decadent.org.uk>
      ae5212ab