1. 06 May, 2015 40 commits
    • Huacai Chen's avatar
      MIPS: Hibernate: flush TLB entries earlier · 6fbe5c7c
      Huacai Chen authored
      commit a843d00d upstream.
      
      We found that TLB mismatch not only happens after kernel resume, but
      also happens during snapshot restore. So move it to the beginning of
      swsusp_arch_suspend().
      Signed-off-by: default avatarHuacai Chen <chenhc@lemote.com>
      Cc: Steven J. Hill <Steven.Hill@imgtec.com>
      Cc: linux-mips@linux-mips.org
      Cc: Fuxin Zhang <zhangfx@lemote.com>
      Cc: Zhangjin Wu <wuzhangjin@gmail.com>
      Patchwork: https://patchwork.linux-mips.org/patch/9621/Signed-off-by: default avatarRalf Baechle <ralf@linux-mips.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      6fbe5c7c
    • Huacai Chen's avatar
      MIPS: Loongson-3: Add IRQF_NO_SUSPEND to Cascade irqaction · 9da87051
      Huacai Chen authored
      commit 0add9c2f upstream.
      
      HPET irq is routed to i8259 and then to MIPS CPU irq (cascade). After
      commit a3e6c1ef (MIPS: IRQ: Fix disable_irq on CPU IRQs), if without
      IRQF_NO_SUSPEND in cascade_irqaction, HPET interrupts will lost during
      suspend. The result is machine cannot be waken up.
      Signed-off-by: default avatarHuacai Chen <chenhc@lemote.com>
      Cc: Steven J. Hill <Steven.Hill@imgtec.com>
      Cc: linux-mips@linux-mips.org
      Cc: Fuxin Zhang <zhangfx@lemote.com>
      Cc: Zhangjin Wu <wuzhangjin@gmail.com>
      Patchwork: https://patchwork.linux-mips.org/patch/9528/Signed-off-by: default avatarRalf Baechle <ralf@linux-mips.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      9da87051
    • Markos Chandras's avatar
      MIPS: unaligned: Fix regular load/store instruction emulation for EVA · e239cb24
      Markos Chandras authored
      commit 6eae3548 upstream.
      
      When emulating a regular lh/lw/lhu/sh/sw we need to use the appropriate
      instruction if we are in EVA mode. This is necessary for userspace
      applications which trigger alignment exceptions. In such case, the
      userspace load/store instruction needs to be emulated with the correct
      eva/non-eva instruction by the kernel emulator.
      Signed-off-by: default avatarMarkos Chandras <markos.chandras@imgtec.com>
      Fixes: c1771216 ("MIPS: kernel: unaligned: Handle unaligned accesses for EVA")
      Cc: linux-mips@linux-mips.org
      Patchwork: https://patchwork.linux-mips.org/patch/9503/Signed-off-by: default avatarRalf Baechle <ralf@linux-mips.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      e239cb24
    • Markos Chandras's avatar
      MIPS: unaligned: Surround load/store macros in do {} while statements · ae0a145c
      Markos Chandras authored
      commit 3563c32d upstream.
      
      It's best to surround such complex macros with do {} while statements
      so they can appear as independent logical blocks when used within other
      control blocks.
      Signed-off-by: default avatarMarkos Chandras <markos.chandras@imgtec.com>
      Cc: linux-mips@linux-mips.org
      Patchwork: https://patchwork.linux-mips.org/patch/9502/Signed-off-by: default avatarRalf Baechle <ralf@linux-mips.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      ae0a145c
    • Markos Chandras's avatar
      MIPS: unaligned: Prevent EVA instructions on kernel unaligned accesses · 88a82d60
      Markos Chandras authored
      commit eeb53895 upstream.
      
      Commit c1771216 ("MIPS: kernel: unaligned: Handle unaligned
      accesses for EVA") allowed unaligned accesses to be emulated for
      EVA. However, when emulating regular load/store unaligned accesses,
      we need to use the appropriate "address space" instructions for that.
      Previously, an unaligned load/store instruction in kernel space would
      have used the corresponding EVA instructions to emulate it which led to
      segmentation faults because of the address translation that happens
      with EVA instructions. This is now fixed by using the EVA instruction
      only when emulating EVA unaligned accesses.
      Signed-off-by: default avatarMarkos Chandras <markos.chandras@imgtec.com>
      Fixes: c1771216 ("MIPS: kernel: unaligned: Handle unaligned accesses for EVA")
      Cc: linux-mips@linux-mips.org
      Patchwork: https://patchwork.linux-mips.org/patch/9501/Signed-off-by: default avatarRalf Baechle <ralf@linux-mips.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      88a82d60
    • Markos Chandras's avatar
      MIPS: asm: asm-eva: Introduce kernel load/store variants · e52a20fc
      Markos Chandras authored
      commit 60cd7e08 upstream.
      
      Introduce new macros for kernel load/store variants which will be
      used to perform regular kernel space load/store operations in EVA
      mode.
      Signed-off-by: default avatarMarkos Chandras <markos.chandras@imgtec.com>
      Cc: linux-mips@linux-mips.org
      Patchwork: https://patchwork.linux-mips.org/patch/9500/Signed-off-by: default avatarRalf Baechle <ralf@linux-mips.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      e52a20fc
    • Markos Chandras's avatar
      MIPS: Malta: Detect and fix bad memsize values · 0668432d
      Markos Chandras authored
      commit f7f8aea4 upstream.
      
      memsize denotes the amount of RAM we can access from kseg{0,1} and
      that should be up to 256M. In case the bootloader reports a value
      higher than that (perhaps reporting all the available RAM) it's best
      if we fix it ourselves and just warn the user about that. This is
      usually a problem with the bootloader and/or its environment.
      
      [ralf@linux-mips.org: Remove useless parens as suggested bei Sergei.
      Reformat long pr_warn statement to fit into 80 column limit.]
      Signed-off-by: default avatarMarkos Chandras <markos.chandras@imgtec.com>
      Cc: linux-mips@linux-mips.org
      Patchwork: https://patchwork.linux-mips.org/patch/9362/Signed-off-by: default avatarRalf Baechle <ralf@linux-mips.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      0668432d
    • James Hogan's avatar
      MIPS: lose_fpu(): Disable FPU when MSA enabled · facbd0f2
      James Hogan authored
      commit acaf6a97 upstream.
      
      The lose_fpu() function only disables the FPU in CP0_Status.CU1 if the
      FPU is in use and MSA isn't enabled.
      
      This isn't necessarily a problem because KSTK_STATUS(current), the
      version of CP0_Status stored on the kernel stack on entry from user
      mode, does always get updated and gets restored when returning to user
      mode, but I don't think it was intended, and it is inconsistent with the
      case of only the FPU being in use. Sometimes leaving the FPU enabled may
      also mask kernel bugs where FPU operations are executed when the FPU
      might not be enabled.
      
      So lets disable the FPU in the MSA case too.
      
      Fixes: 33c771ba ("MIPS: save/disable MSA in lose_fpu")
      Signed-off-by: default avatarJames Hogan <james.hogan@imgtec.com>
      Cc: Ralf Baechle <ralf@linux-mips.org>
      Cc: Paul Burton <paul.burton@imgtec.com>
      Cc: linux-mips@linux-mips.org
      Patchwork: https://patchwork.linux-mips.org/patch/9323/Signed-off-by: default avatarRalf Baechle <ralf@linux-mips.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      facbd0f2
    • James Hogan's avatar
      MIPS: KVM: Handle MSA Disabled exceptions from guest · 7e5ed3d7
      James Hogan authored
      commit 98119ad5 upstream.
      
      Guest user mode can generate a guest MSA Disabled exception on an MSA
      capable core by simply trying to execute an MSA instruction. Since this
      exception is unknown to KVM it will be passed on to the guest kernel.
      However guest Linux kernels prior to v3.15 do not set up an exception
      handler for the MSA Disabled exception as they don't support any MSA
      capable cores. This results in a guest OS panic.
      
      Since an older processor ID may be being emulated, and MSA support is
      not advertised to the guest, the correct behaviour is to generate a
      Reserved Instruction exception in the guest kernel so it can send the
      guest process an illegal instruction signal (SIGILL), as would happen
      with a non-MSA-capable core.
      
      Fix this as minimally as reasonably possible by preventing
      kvm_mips_check_privilege() from relaying MSA Disabled exceptions from
      guest user mode to the guest kernel, and handling the MSA Disabled
      exception by emulating a Reserved Instruction exception in the guest,
      via a new handle_msa_disabled() KVM callback.
      Signed-off-by: default avatarJames Hogan <james.hogan@imgtec.com>
      Cc: Paolo Bonzini <pbonzini@redhat.com>
      Cc: Paul Burton <paul.burton@imgtec.com>
      Cc: Ralf Baechle <ralf@linux-mips.org>
      Cc: Gleb Natapov <gleb@kernel.org>
      Cc: linux-mips@linux-mips.org
      Cc: kvm@vger.kernel.org
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      7e5ed3d7
    • Ben Serebrin's avatar
      KVM: VMX: Preserve host CR4.MCE value while in guest mode. · 9656af0b
      Ben Serebrin authored
      commit 085e68ee upstream.
      
      The host's decision to enable machine check exceptions should remain
      in force during non-root mode.  KVM was writing 0 to cr4 on VCPU reset
      and passed a slightly-modified 0 to the vmcs.guest_cr4 value.
      
      Tested: Built.
      On earlier version, tested by injecting machine check
      while a guest is spinning.
      
      Before the change, if guest CR4.MCE==0, then the machine check is
      escalated to Catastrophic Error (CATERR) and the machine dies.
      If guest CR4.MCE==1, then the machine check causes VMEXIT and is
      handled normally by host Linux. After the change, injecting a machine
      check causes normal Linux machine check handling.
      Signed-off-by: default avatarBen Serebrin <serebrin@google.com>
      Reviewed-by: default avatarVenkatesh Srinivas <venkateshs@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      9656af0b
    • Andre Przywara's avatar
      KVM: arm/arm64: check IRQ number on userland injection · fb124f8c
      Andre Przywara authored
      commit fd1d0ddf upstream.
      
      When userland injects a SPI via the KVM_IRQ_LINE ioctl we currently
      only check it against a fixed limit, which historically is set
      to 127. With the new dynamic IRQ allocation the effective limit may
      actually be smaller (64).
      So when now a malicious or buggy userland injects a SPI in that
      range, we spill over on our VGIC bitmaps and bytemaps memory.
      I could trigger a host kernel NULL pointer dereference with current
      mainline by injecting some bogus IRQ number from a hacked kvmtool:
      -----------------
      ....
      DEBUG: kvm_vgic_inject_irq(kvm, cpu=0, irq=114, level=1)
      DEBUG: vgic_update_irq_pending(kvm, cpu=0, irq=114, level=1)
      DEBUG: IRQ #114 still in the game, writing to bytemap now...
      Unable to handle kernel NULL pointer dereference at virtual address 00000000
      pgd = ffffffc07652e000
      [00000000] *pgd=00000000f658b003, *pud=00000000f658b003, *pmd=0000000000000000
      Internal error: Oops: 96000006 [#1] PREEMPT SMP
      Modules linked in:
      CPU: 1 PID: 1053 Comm: lkvm-msi-irqinj Not tainted 4.0.0-rc7+ #3027
      Hardware name: FVP Base (DT)
      task: ffffffc0774e9680 ti: ffffffc0765a8000 task.ti: ffffffc0765a8000
      PC is at kvm_vgic_inject_irq+0x234/0x310
      LR is at kvm_vgic_inject_irq+0x30c/0x310
      pc : [<ffffffc0000ae0a8>] lr : [<ffffffc0000ae180>] pstate: 80000145
      .....
      
      So this patch fixes this by checking the SPI number against the
      actual limit. Also we remove the former legacy hard limit of
      127 in the ioctl code.
      Signed-off-by: default avatarAndre Przywara <andre.przywara@arm.com>
      Reviewed-by: default avatarChristoffer Dall <christoffer.dall@linaro.org>
      [maz: wrap KVM_ARM_IRQ_GIC_MAX with #ifndef __KERNEL__,
      as suggested by Christopher Covington]
      Signed-off-by: default avatarMarc Zyngier <marc.zyngier@arm.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      fb124f8c
    • Radim Krčmář's avatar
      KVM: use slowpath for cross page cached accesses · 15254fde
      Radim Krčmář authored
      commit ca3f0874 upstream.
      
      kvm_write_guest_cached() does not mark all written pages as dirty and
      code comments in kvm_gfn_to_hva_cache_init() talk about NULL memslot
      with cross page accesses.  Fix all the easy way.
      
      The check is '<= 1' to have the same result for 'len = 0' cache anywhere
      in the page.  (nr_pages_needed is 0 on page boundary.)
      
      Fixes: 8f964525 ("KVM: Allow cross page reads and writes from cached translations.")
      Signed-off-by: default avatarRadim Krčmář <rkrcmar@redhat.com>
      Message-Id: <20150408121648.GA3519@potion.brq.redhat.com>
      Reviewed-by: default avatarWanpeng Li <wanpeng.li@linux.intel.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      15254fde
    • Heiko Carstens's avatar
      s390/hibernate: fix save and restore of kernel text section · 654de1f9
      Heiko Carstens authored
      commit d7441949 upstream.
      
      Sebastian reported a crash caused by a jump label mismatch after resume.
      This happens because we do not save the kernel text section during suspend
      and therefore also do not restore it during resume, but use the kernel image
      that restores the old system.
      
      This means that after a suspend/resume cycle we lost all modifications done
      to the kernel text section.
      The reason for this is the pfn_is_nosave() function, which incorrectly
      returns that read-only pages don't need to be saved. This is incorrect since
      we mark the kernel text section read-only.
      We still need to make sure to not save and restore pages contained within
      NSS and DCSS segment.
      To fix this add an extra case for the kernel text section and only save
      those pages if they are not contained within an NSS segment.
      
      Fixes the following crash (and the above bugs as well):
      
      Jump label code mismatch at netif_receive_skb_internal+0x28/0xd0
      Found:    c0 04 00 00 00 00
      Expected: c0 f4 00 00 00 11
      New:      c0 04 00 00 00 00
      Kernel panic - not syncing: Corrupted kernel text
      CPU: 0 PID: 9 Comm: migration/0 Not tainted 3.19.0-01975-gb1b096e70f23 #4
      Call Trace:
        [<0000000000113972>] show_stack+0x72/0xf0
        [<000000000081f15e>] dump_stack+0x6e/0x90
        [<000000000081c4e8>] panic+0x108/0x2b0
        [<000000000081be64>] jump_label_bug.isra.2+0x104/0x108
        [<0000000000112176>] __jump_label_transform+0x9e/0xd0
        [<00000000001121e6>] __sm_arch_jump_label_transform+0x3e/0x50
        [<00000000001d1136>] multi_cpu_stop+0x12e/0x170
        [<00000000001d1472>] cpu_stopper_thread+0xb2/0x168
        [<000000000015d2ac>] smpboot_thread_fn+0x134/0x1b0
        [<0000000000158baa>] kthread+0x10a/0x110
        [<0000000000824a86>] kernel_thread_starter+0x6/0xc
      Reported-and-tested-by: default avatarSebastian Ott <sebott@linux.vnet.ibm.com>
      Signed-off-by: default avatarHeiko Carstens <heiko.carstens@de.ibm.com>
      Signed-off-by: default avatarMartin Schwidefsky <schwidefsky@de.ibm.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      654de1f9
    • Jens Freimann's avatar
      KVM: s390: fix get_all_floating_irqs · 4756129f
      Jens Freimann authored
      commit 94aa033e upstream.
      
      This fixes a bug introduced with commit c05c4186 ("KVM: s390:
      add floating irq controller").
      
      get_all_floating_irqs() does copy_to_user() while holding
      a spin lock. Let's fix this by filling a temporary buffer
      first and copy it to userspace after giving up the lock.
      Reviewed-by: default avatarDavid Hildenbrand <dahi@linux.vnet.ibm.com>
      Signed-off-by: default avatarJens Freimann <jfrei@linux.vnet.ibm.com>
      Signed-off-by: default avatarChristian Borntraeger <borntraeger@de.ibm.com>
      Acked-by: default avatarCornelia Huck <cornelia.huck@de.ibm.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      4756129f
    • Ekaterina Tumanova's avatar
      KVM: s390: Zero out current VMDB of STSI before including level3 data. · 7f1a4ebe
      Ekaterina Tumanova authored
      commit b75f4c9a upstream.
      
      s390 documentation requires words 0 and 10-15 to be reserved and stored as
      zeros. As we fill out all other fields, we can memset the full structure.
      Signed-off-by: default avatarEkaterina Tumanova <tumanova@linux.vnet.ibm.com>
      Reviewed-by: default avatarDavid Hildenbrand <dahi@linux.vnet.ibm.com>
      Signed-off-by: default avatarChristian Borntraeger <borntraeger@de.ibm.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      7f1a4ebe
    • David Hildenbrand's avatar
      KVM: s390: reinjection of irqs can fail in the tpi handler · 98529eff
      David Hildenbrand authored
      commit 15462e37 upstream.
      
      The reinjection of an I/O interrupt can fail if the list is at the limit
      and between the dequeue and the reinjection, another I/O interrupt is
      injected (e.g. if user space floods kvm with I/O interrupts).
      
      This patch avoids this memory leak and returns -EFAULT in this special
      case. This error is not recoverable, so let's fail hard. This can later
      be avoided by not dequeuing the interrupt but working directly on the
      locked list.
      Signed-off-by: default avatarDavid Hildenbrand <dahi@linux.vnet.ibm.com>
      Signed-off-by: default avatarChristian Borntraeger <borntraeger@de.ibm.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      98529eff
    • David Hildenbrand's avatar
      KVM: s390: fix handling of write errors in the tpi handler · bcdd54ff
      David Hildenbrand authored
      commit 261520dc upstream.
      
      If the I/O interrupt could not be written to the guest provided
      area (e.g. access exception), a program exception was injected into the
      guest but "inti" wasn't freed, therefore resulting in a memory leak.
      
      In addition, the I/O interrupt wasn't reinjected. Therefore the dequeued
      interrupt is lost.
      
      This patch fixes the problem while cleaning up the function and making the
      cc and rc logic easier to handle.
      Signed-off-by: default avatarDavid Hildenbrand <dahi@linux.vnet.ibm.com>
      Signed-off-by: default avatarChristian Borntraeger <borntraeger@de.ibm.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      bcdd54ff
    • Andrzej Pietrasiewicz's avatar
      usb: gadget: printer: enqueue printer's response for setup request · 9297ed24
      Andrzej Pietrasiewicz authored
      commit eb132ccb upstream.
      
      Function-specific setup requests should be handled in such a way, that
      apart from filling in the data buffer, the requests are also actually
      enqueued: if function-specific setup is called from composte_setup(),
      the "usb_ep_queue()" block of code in composite_setup() is skipped.
      
      The printer function lacks this part and it results in e.g. get device id
      requests failing: the host expects some response, the device prepares it
      but does not equeue it for sending to the host, so the host finally asserts
      timeout.
      
      This patch adds enqueueing the prepared responses.
      
      Fixes: 2e87edf4: "usb: gadget: make g_printer use composite"
      Signed-off-by: default avatarAndrzej Pietrasiewicz <andrzej.p@samsung.com>
      Signed-off-by: default avatarFelipe Balbi <balbi@ti.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      9297ed24
    • Scott Wood's avatar
      powerpc/hugetlb: Call mm_dec_nr_pmds() in hugetlb_free_pmd_range() · 5cb46afa
      Scott Wood authored
      commit 50c6a665 upstream.
      
      Commit dc6c9a35 ("mm: account pmd page tables to the process")
      added a counter that is incremented whenever a PMD is allocated and
      decremented whenever a PMD is freed.  For hugepages on PPC, common code
      is used to allocated PMDs, but arch-specific code is used to free PMDs.
      
      This results in kernel output such as "BUG: non-zero nr_pmds on freeing
      mm: 1" when using hugepages.
      
      Update the PPC hugepage PMD freeing code to decrement the count, just
      as the above commit did for free_pmd_range().
      
      Fixes: dc6c9a35 ("mm: account pmd page tables to the process")
      Signed-off-by: default avatarScott Wood <scottwood@freescale.com>
      Reviewed-by: default avatarAneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      5cb46afa
    • Gerald Schaefer's avatar
      mm/hugetlb: use pmd_page() in follow_huge_pmd() · 5683056e
      Gerald Schaefer authored
      commit 97534127 upstream.
      
      Commit 61f77eda ("mm/hugetlb: reduce arch dependent code around
      follow_huge_*") broke follow_huge_pmd() on s390, where pmd and pte
      layout differ and using pte_page() on a huge pmd will return wrong
      results.  Using pmd_page() instead fixes this.
      
      All architectures that were touched by that commit have pmd_page()
      defined, so this should not break anything on other architectures.
      
      Fixes: 61f77eda "mm/hugetlb: reduce arch dependent code around follow_huge_*"
      Signed-off-by: default avatarGerald Schaefer <gerald.schaefer@de.ibm.com>
      Acked-by: default avatarNaoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
      Acked-by: default avatarDavid Rientjes <rientjes@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      5683056e
    • Filipe Manana's avatar
      Btrfs: fix inode eviction infinite loop after extent_same ioctl · 68ea2629
      Filipe Manana authored
      commit 113e8283 upstream.
      
      If we pass a length of 0 to the extent_same ioctl, we end up locking an
      extent range with a start offset greater then its end offset (if the
      destination file's offset is greater than zero). This results in a warning
      from extent_io.c:insert_state through the following call chain:
      
        btrfs_extent_same()
          btrfs_double_lock()
            lock_extent_range()
              lock_extent(inode->io_tree, offset, offset + len - 1)
                lock_extent_bits()
                  __set_extent_bit()
                    insert_state()
                      --> WARN_ON(end < start)
      
      This leads to an infinite loop when evicting the inode. This is the same
      problem that my previous patch titled
      "Btrfs: fix inode eviction infinite loop after cloning into it" addressed
      but for the extent_same ioctl instead of the clone ioctl.
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Reviewed-by: default avatarOmar Sandoval <osandov@osandov.com>
      Signed-off-by: default avatarChris Mason <clm@fb.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      68ea2629
    • Filipe Manana's avatar
      Btrfs: fix inode eviction infinite loop after cloning into it · 9301d506
      Filipe Manana authored
      commit ccccf3d6 upstream.
      
      If we attempt to clone a 0 length region into a file we can end up
      inserting a range in the inode's extent_io tree with a start offset
      that is greater then the end offset, which triggers immediately the
      following warning:
      
      [ 3914.619057] WARNING: CPU: 17 PID: 4199 at fs/btrfs/extent_io.c:435 insert_state+0x4b/0x10b [btrfs]()
      [ 3914.620886] BTRFS: end < start 4095 4096
      (...)
      [ 3914.638093] Call Trace:
      [ 3914.638636]  [<ffffffff81425fd9>] dump_stack+0x4c/0x65
      [ 3914.639620]  [<ffffffff81045390>] warn_slowpath_common+0xa1/0xbb
      [ 3914.640789]  [<ffffffffa03ca44f>] ? insert_state+0x4b/0x10b [btrfs]
      [ 3914.642041]  [<ffffffff810453f0>] warn_slowpath_fmt+0x46/0x48
      [ 3914.643236]  [<ffffffffa03ca44f>] insert_state+0x4b/0x10b [btrfs]
      [ 3914.644441]  [<ffffffffa03ca729>] __set_extent_bit+0x107/0x3f4 [btrfs]
      [ 3914.645711]  [<ffffffffa03cb256>] lock_extent_bits+0x65/0x1bf [btrfs]
      [ 3914.646914]  [<ffffffff8142b2fb>] ? _raw_spin_unlock+0x28/0x33
      [ 3914.648058]  [<ffffffffa03cbac4>] ? test_range_bit+0xcc/0xde [btrfs]
      [ 3914.650105]  [<ffffffffa03cb3c3>] lock_extent+0x13/0x15 [btrfs]
      [ 3914.651361]  [<ffffffffa03db39e>] lock_extent_range+0x3d/0xcd [btrfs]
      [ 3914.652761]  [<ffffffffa03de1fe>] btrfs_ioctl_clone+0x278/0x388 [btrfs]
      [ 3914.654128]  [<ffffffff811226dd>] ? might_fault+0x58/0xb5
      [ 3914.655320]  [<ffffffffa03e0909>] btrfs_ioctl+0xb51/0x2195 [btrfs]
      (...)
      [ 3914.669271] ---[ end trace 14843d3e2e622fc1 ]---
      
      This later makes the inode eviction handler enter an infinite loop that
      keeps dumping the following warning over and over:
      
      [ 3915.117629] WARNING: CPU: 22 PID: 4228 at fs/btrfs/extent_io.c:435 insert_state+0x4b/0x10b [btrfs]()
      [ 3915.119913] BTRFS: end < start 4095 4096
      (...)
      [ 3915.137394] Call Trace:
      [ 3915.137913]  [<ffffffff81425fd9>] dump_stack+0x4c/0x65
      [ 3915.139154]  [<ffffffff81045390>] warn_slowpath_common+0xa1/0xbb
      [ 3915.140316]  [<ffffffffa03ca44f>] ? insert_state+0x4b/0x10b [btrfs]
      [ 3915.141505]  [<ffffffff810453f0>] warn_slowpath_fmt+0x46/0x48
      [ 3915.142709]  [<ffffffffa03ca44f>] insert_state+0x4b/0x10b [btrfs]
      [ 3915.143849]  [<ffffffffa03ca729>] __set_extent_bit+0x107/0x3f4 [btrfs]
      [ 3915.145120]  [<ffffffffa038c1e3>] ? btrfs_kill_super+0x17/0x23 [btrfs]
      [ 3915.146352]  [<ffffffff811548f6>] ? deactivate_locked_super+0x3b/0x50
      [ 3915.147565]  [<ffffffffa03cb256>] lock_extent_bits+0x65/0x1bf [btrfs]
      [ 3915.148785]  [<ffffffff8142b7e2>] ? _raw_write_unlock+0x28/0x33
      [ 3915.149931]  [<ffffffffa03bc325>] btrfs_evict_inode+0x196/0x482 [btrfs]
      [ 3915.151154]  [<ffffffff81168904>] evict+0xa0/0x148
      [ 3915.152094]  [<ffffffff811689e5>] dispose_list+0x39/0x43
      [ 3915.153081]  [<ffffffff81169564>] evict_inodes+0xdc/0xeb
      [ 3915.154062]  [<ffffffff81154418>] generic_shutdown_super+0x49/0xef
      [ 3915.155193]  [<ffffffff811546d1>] kill_anon_super+0x13/0x1e
      [ 3915.156274]  [<ffffffffa038c1e3>] btrfs_kill_super+0x17/0x23 [btrfs]
      (...)
      [ 3915.167404] ---[ end trace 14843d3e2e622fc2 ]---
      
      So just bail out of the clone ioctl if the length of the region to clone
      is zero, without locking any extent range, in order to prevent this issue
      (same behaviour as a pwrite with a 0 length for example).
      
      This is trivial to reproduce. For example, the steps for the test I just
      made for fstests:
      
        mkfs.btrfs -f SCRATCH_DEV
        mount SCRATCH_DEV $SCRATCH_MNT
      
        touch $SCRATCH_MNT/foo
        touch $SCRATCH_MNT/bar
      
        $CLONER_PROG -s 0 -d 4096 -l 0 $SCRATCH_MNT/foo $SCRATCH_MNT/bar
        umount $SCRATCH_MNT
      
      A test case for fstests follows soon.
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Reviewed-by: default avatarOmar Sandoval <osandov@osandov.com>
      Signed-off-by: default avatarChris Mason <clm@fb.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      9301d506
    • David Sterba's avatar
      btrfs: don't accept bare namespace as a valid xattr · 1f6719c2
      David Sterba authored
      commit 3c3b04d1 upstream.
      
      Due to insufficient check in btrfs_is_valid_xattr, this unexpectedly
      works:
      
       $ touch file
       $ setfattr -n user. -v 1 file
       $ getfattr -d file
      user.="1"
      
      ie. the missing attribute name after the namespace.
      
      Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=94291Reported-by: default avatarWilliam Douglas <william.douglas@intel.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.cz>
      Signed-off-by: default avatarChris Mason <clm@fb.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      1f6719c2
    • Filipe Manana's avatar
      Btrfs: fix log tree corruption when fs mounted with -o discard · 7362dcdb
      Filipe Manana authored
      commit dcc82f47 upstream.
      
      While committing a transaction we free the log roots before we write the
      new super block. Freeing the log roots implies marking the disk location
      of every node/leaf (metadata extent) as pinned before the new super block
      is written. This is to prevent the disk location of log metadata extents
      from being reused before the new super block is written, otherwise we
      would have a corrupted log tree if before the new super block is written
      a crash/reboot happens and the location of any log tree metadata extent
      ended up being reused and rewritten.
      
      Even though we pinned the log tree's metadata extents, we were issuing a
      discard against them if the fs was mounted with the -o discard option,
      resulting in corruption of the log tree if a crash/reboot happened before
      writing the new super block - the next time the fs was mounted, during
      the log replay process we would find nodes/leafs of the log btree with
      a content full of zeroes, causing the process to fail and require the
      use of the tool btrfs-zero-log to wipeout the log tree (and all data
      previously fsynced becoming lost forever).
      
      Fix this by not doing a discard when pinning an extent. The discard will
      be done later when it's safe (after the new super block is committed) at
      extent-tree.c:btrfs_finish_extent_commit().
      
      Fixes: e688b725 (Btrfs: fix extent pinning bugs in the tree log)
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarChris Mason <clm@fb.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      7362dcdb
    • Nadav Amit's avatar
      KVM: x86: Fix MSR_IA32_BNDCFGS in msrs_to_save · 47b34f85
      Nadav Amit authored
      commit 9e9c3fe4 upstream.
      
      kvm_init_msr_list is currently called before hardware_setup. As a result,
      vmx_mpx_supported always returns false when kvm_init_msr_list checks whether to
      save MSR_IA32_BNDCFGS.
      
      Move kvm_init_msr_list after vmx_hardware_setup is called to fix this issue.
      Signed-off-by: default avatarNadav Amit <namit@cs.technion.ac.il>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      
      Message-Id: <1428864435-4732-1-git-send-email-namit@cs.technion.ac.il>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      47b34f85
    • Linus Torvalds's avatar
      x86: fix special __probe_kernel_write() tail zeroing case · 5c966c4f
      Linus Torvalds authored
      commit d869844b upstream.
      
      Commit cae2a173 ("x86: clean up/fix 'copy_in_user()' tail zeroing")
      fixed the failure case tail zeroing of one special case of the x86-64
      generic user-copy routine, namely when used for the user-to-user case
      ("copy_in_user()").
      
      But in the process it broke an even more unusual case: using the user
      copy routine for kernel-to-kernel copying.
      
      Now, normally kernel-kernel copies are obviously done using memcpy(),
      but we have a couple of special cases when we use the user-copy
      functions.  One is when we pass a kernel buffer to a regular user-buffer
      routine, using set_fs(KERNEL_DS).  That's a "normal" case, and continued
      to work fine, because it never takes any faults (with the possible
      exception of a silent and successful vmalloc fault).
      
      But Jan Beulich pointed out another, very unusual, special case: when we
      use the user-copy routines not because it's a path that expects a user
      pointer, but for a couple of ftrace/kgdb cases that want to do a kernel
      copy, but do so using "unsafe" buffers, and use the user-copy routine to
      gracefully handle faults.  IOW, for probe_kernel_write().
      
      And that broke for the case of a faulting kernel destination, because we
      saw the kernel destination and wanted to try to clear the tail of the
      buffer.  Which doesn't work, since that's what faults.
      
      This only triggers for things like kgdb and ftrace users (eg trying
      setting a breakpoint on read-only memory), but it's definitely a bug.
      The fix is to not compare against the kernel address start (TASK_SIZE),
      but instead use the same limits "access_ok()" uses.
      Reported-and-tested-by: default avatarJan Beulich <jbeulich@suse.com>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      5c966c4f
    • Peter Zijlstra's avatar
      perf/x86/intel: Fix Core2,Atom,NHM,WSM cycles:pp events · 6e4dd840
      Peter Zijlstra authored
      commit 517e6341 upstream.
      
      Ingo reported that cycles:pp didn't work for him on some machines.
      
      It turns out that in this commit:
      
        af4bdcf6 perf/x86/intel: Disallow flags for most Core2/Atom/Nehalem/Westmere events
      
      Andi forgot to explicitly allow that event when he
      disabled event flags for PEBS on those uarchs.
      Reported-by: default avatarIngo Molnar <mingo@kernel.org>
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
      Cc: Jiri Olsa <jolsa@redhat.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Fixes: af4bdcf6 ("perf/x86/intel: Disallow flags for most Core2/Atom/Nehalem/Westmere events")
      Signed-off-by: default avatarIngo Molnar <mingo@kernel.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      6e4dd840
    • Mike Galbraith's avatar
      sched/idle/x86: Optimize unnecessary mwait_idle() resched IPIs · aaa51337
      Mike Galbraith authored
      commit f8e617f4 upstream.
      
      To fully take advantage of MWAIT, apparently the CLFLUSH instruction needs
      another quirk on certain CPUs: proper barriers around it on certain machines.
      
      On a Q6600 SMP system, pipe-test scheduling performance, cross core,
      improves significantly:
      
        3.8.13                   487.2 KHz    1.000
        3.13.0-master            415.5 KHz     .852
        3.13.0-master+           415.2 KHz     .852     + restore mwait_idle
        3.13.0-master++          488.5 KHz    1.002     + restore mwait_idle + IPI fix
      
      Since X86_BUG_CLFLUSH_MONITOR is already a quirk, don't create a separate
      quirk for the extra smp_mb()s.
      Signed-off-by: default avatarMike Galbraith <bitbucket@online.de>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Ian Malone <ibmalone@gmail.com>
      Cc: Josh Boyer <jwboyer@redhat.com>
      Cc: Len Brown <len.brown@intel.com>
      Cc: Len Brown <lenb@kernel.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Link: http://lkml.kernel.org/r/1390061684.5566.4.camel@marge.simpson.net
      [ Ported to recent kernel, added comments about the quirk. ]
      Signed-off-by: default avatarIngo Molnar <mingo@kernel.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      aaa51337
    • Len Brown's avatar
      sched/idle/x86: Restore mwait_idle() to fix boot hangs, to improve power... · 0e625b6d
      Len Brown authored
      sched/idle/x86: Restore mwait_idle() to fix boot hangs, to improve power savings and to improve performance
      
      commit b253149b upstream.
      
      In Linux-3.9 we removed the mwait_idle() loop:
      
        69fb3676 ("x86 idle: remove mwait_idle() and "idle=mwait" cmdline param")
      
      The reasoning was that modern machines should be sufficiently
      happy during the boot process using the default_idle() HALT
      loop, until cpuidle loads and either acpi_idle or intel_idle
      invoke the newer MWAIT-with-hints idle loop.
      
      But two machines reported problems:
      
       1. Certain Core2-era machines support MWAIT-C1 and HALT only.
          MWAIT-C1 is preferred for optimal power and performance.
          But if they support just C1, cpuidle never loads and
          so they use the boot-time default idle loop forever.
      
       2. Some laptops will boot-hang if HALT is used,
          but will boot successfully if MWAIT is used.
          This appears to be a hidden assumption in BIOS SMI,
          that is presumably valid on the proprietary OS
          where the BIOS was validated.
      
             https://bugzilla.kernel.org/show_bug.cgi?id=60770
      
      So here we effectively revert the patch above, restoring
      the mwait_idle() loop.  However, we don't bother restoring
      the idle=mwait cmdline parameter, since it appears to add
      no value.
      
      Maintainer notes:
      
        For 3.9, simply revert 69fb3676
        for 3.10, patch -F3 applies, fuzz needed due to __cpuinit use in
        context For 3.11, 3.12, 3.13, this patch applies cleanly
      Tested-by: default avatarMike Galbraith <bitbucket@online.de>
      Signed-off-by: default avatarLen Brown <len.brown@intel.com>
      Acked-by: default avatarMike Galbraith <bitbucket@online.de>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Ian Malone <ibmalone@gmail.com>
      Cc: Josh Boyer <jwboyer@redhat.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Link: http://lkml.kernel.org/r/345254a551eb5a6a866e048d7ab570fd2193aca4.1389763084.git.len.brown@intel.com
      [ Ported to recent kernels. ]
      Signed-off-by: default avatarIngo Molnar <mingo@kernel.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      
      0e625b6d
    • Radim Krčmář's avatar
      x86: vdso: fix pvclock races with task migration · 82a7e673
      Radim Krčmář authored
      commit 80f7fdb1 upstream.
      
      If we were migrated right after __getcpu, but before reading the
      migration_count, we wouldn't notice that we read TSC of a different
      VCPU, nor that KVM's bug made pvti invalid, as only migration_count
      on source VCPU is increased.
      
      Change vdso instead of updating migration_count on destination.
      Signed-off-by: default avatarRadim Krčmář <rkrcmar@redhat.com>
      Fixes: 0a4e6be9 ("x86: kvm: Revert "remove sched notifier for cross-cpu migrations"")
      Message-Id: <1428000263-11892-1-git-send-email-rkrcmar@redhat.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      82a7e673
    • Marcelo Tosatti's avatar
      x86: kvm: Revert "remove sched notifier for cross-cpu migrations" · 3fbb83fd
      Marcelo Tosatti authored
      commit 0a4e6be9 upstream.
      
      The following point:
      
          2. per-CPU pvclock time info is updated if the
             underlying CPU changes.
      
      Is not true anymore since "KVM: x86: update pvclock area conditionally,
      on cpu migration".
      
      Add task migration notification back.
      
      Problem noticed by Andy Lutomirski.
      Signed-off-by: default avatarMarcelo Tosatti <mtosatti@redhat.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      3fbb83fd
    • Andy Lutomirski's avatar
      x86/asm/decoder: Fix and enforce max instruction size in the insn decoder · bbe33d79
      Andy Lutomirski authored
      commit 91e5ed49 upstream.
      
      x86 instructions cannot exceed 15 bytes, and the instruction
      decoder should enforce that.  Prior to 6ba48ff4, the
      instruction length limit was implicitly set to 16, which was an
      approximation of 15, but there is currently no limit at all.
      
      Fix MAX_INSN_SIZE (it should be 15, not 16), and fix the decoder
      to reject instructions that exceed MAX_INSN_SIZE.
      
      Other than potentially confusing some of the decoder sanity
      checks, I'm not aware of any actual problems that omitting this
      check would cause, nor am I aware of any practical problems
      caused by the MAX_INSN_SIZE error.
      Signed-off-by: default avatarAndy Lutomirski <luto@amacapital.net>
      Acked-by: default avatarMasami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Fixes: 6ba48ff4 ("x86: Remove arbitrary instruction size limit ...
      Link: http://lkml.kernel.org/r/f8f0bc9b8c58cfd6830f7d88400bf1396cbdcd0f.1422403511.git.luto@amacapital.netSigned-off-by: default avatarIngo Molnar <mingo@kernel.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      bbe33d79
    • Gu Zheng's avatar
      md: fix md io stats accounting broken · 8336ee90
      Gu Zheng authored
      commit 74672d06 upstream.
      
      Simon reported the md io stats accounting issue:
      "
      I'm seeing "iostat -x -k 1" print this after a RAID1 rebuild on 4.0-rc5.
      It's not abnormal other than it's 3-disk, with one being SSD (sdc) and
      the other two being write-mostly:
      
      Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
      sda               0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00    0.00    0.00   0.00   0.00
      sdb               0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00    0.00    0.00   0.00   0.00
      sdc               0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00    0.00    0.00   0.00   0.00
      md0               0.00     0.00    0.00    0.00     0.00     0.00     0.00   345.00    0.00    0.00    0.00   0.00 100.00
      md2               0.00     0.00    0.00    0.00     0.00     0.00     0.00 58779.00    0.00    0.00    0.00   0.00 100.00
      md1               0.00     0.00    0.00    0.00     0.00     0.00     0.00    12.00    0.00    0.00    0.00   0.00 100.00
      "
      The cause is commit "18c0b223" uses the
      generic_start_io_acct to account the disk stats rather than the open code,
      but it also introduced the increase to .in_flight[rw] which is needless to
      md. So we re-use the open code here to fix it.
      Reported-by: default avatarSimon Kirby <sim@hostway.ca>
      Signed-off-by: default avatarGu Zheng <guz.fnst@cn.fujitsu.com>
      Signed-off-by: default avatarNeilBrown <neilb@suse.de>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      8336ee90
    • Amir Vadai's avatar
      net/mlx4_en: Prevent setting invalid RSS hash function · 36fb8ea9
      Amir Vadai authored
      [ Upstream commit b3706909 ]
      
      mlx4_en_check_rxfh_func() was checking for hardware support before
      setting a known RSS hash function, but didn't do any check before
      setting unknown RSS hash function. Need to make it fail on such values.
      In this occasion, moved the actual setting of the new value from the
      check function into mlx4_en_set_rxfh().
      
      Fixes: 947cbb0a ("net/mlx4_en: Support for configurable RSS hash function")
      Signed-off-by: default avatarAmir Vadai <amirv@mellanox.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      36fb8ea9
    • Eric Dumazet's avatar
      net: rfs: fix crash in get_rps_cpus() · b32dec8a
      Eric Dumazet authored
      [ Upstream commit a31196b0 ]
      
      Commit 567e4b79 ("net: rfs: add hash collision detection") had one
      mistake :
      
      RPS_NO_CPU is no longer the marker for invalid cpu in set_rps_cpu()
      and get_rps_cpu(), as @next_cpu was the result of an AND with
      rps_cpu_mask
      
      This bug showed up on a host with 72 cpus :
      next_cpu was 0x7f, and the code was trying to access percpu data of an
      non existent cpu.
      
      In a follow up patch, we might get rid of compares against nr_cpu_ids,
      if we init the tables with 0. This is silly to test for a very unlikely
      condition that exists only shortly after table initialization, as
      we got rid of rps_reset_sock_flow() and similar functions that were
      writing this RPS_NO_CPU magic value at flow dismantle : When table is
      old enough, it never contains this value anymore.
      
      Fixes: 567e4b79 ("net: rfs: add hash collision detection")
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Cc: Tom Herbert <tom@herbertland.com>
      Cc: Ben Hutchings <ben@decadent.org.uk>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      b32dec8a
    • Alexey Khoroshilov's avatar
      pxa168: fix double deallocation of managed resources · f80e3eb9
      Alexey Khoroshilov authored
      [ Upstream commit 0e03fd3e ]
      
      Commit 43d3ddf8 ("net: pxa168_eth: add device tree support") starts
      to use managed resources by adding devm_clk_get() and
      devm_ioremap_resource(), but it leaves explicit iounmap() and clock_put()
      in pxa168_eth_remove() and in failure handling code of pxa168_eth_probe().
      As a result double free can happen.
      
      The patch removes explicit resource deallocation. Also it converts
      clk_disable() to clk_disable_unprepare() to make it symmetrical with
      clk_prepare_enable().
      
      Found by Linux Driver Verification project (linuxtesting.org).
      Signed-off-by: default avatarAlexey Khoroshilov <khoroshilov@ispras.ru>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      f80e3eb9
    • Eric Dumazet's avatar
      net: fix crash in build_skb() · f009181d
      Eric Dumazet authored
      [ Upstream commit 2ea2f62c ]
      
      When I added pfmemalloc support in build_skb(), I forgot netlink
      was using build_skb() with a vmalloc() area.
      
      In this patch I introduce __build_skb() for netlink use,
      and build_skb() is a wrapper handling both skb->head_frag and
      skb->pfmemalloc
      
      This means netlink no longer has to hack skb->head_frag
      
      [ 1567.700067] kernel BUG at arch/x86/mm/physaddr.c:26!
      [ 1567.700067] invalid opcode: 0000 [#1] PREEMPT SMP KASAN
      [ 1567.700067] Dumping ftrace buffer:
      [ 1567.700067]    (ftrace buffer empty)
      [ 1567.700067] Modules linked in:
      [ 1567.700067] CPU: 9 PID: 16186 Comm: trinity-c182 Not tainted 4.0.0-next-20150424-sasha-00037-g4796e21 #2167
      [ 1567.700067] task: ffff880127efb000 ti: ffff880246770000 task.ti: ffff880246770000
      [ 1567.700067] RIP: __phys_addr (arch/x86/mm/physaddr.c:26 (discriminator 3))
      [ 1567.700067] RSP: 0018:ffff8802467779d8  EFLAGS: 00010202
      [ 1567.700067] RAX: 000041000ed8e000 RBX: ffffc9008ed8e000 RCX: 000000000000002c
      [ 1567.700067] RDX: 0000000000000004 RSI: 0000000000000000 RDI: ffffffffb3fd6049
      [ 1567.700067] RBP: ffff8802467779f8 R08: 0000000000000019 R09: ffff8801d0168000
      [ 1567.700067] R10: ffff8801d01680c7 R11: ffffed003a02d019 R12: ffffc9000ed8e000
      [ 1567.700067] R13: 0000000000000f40 R14: 0000000000001180 R15: ffffc9000ed8e000
      [ 1567.700067] FS:  00007f2a7da3f700(0000) GS:ffff8801d1000000(0000) knlGS:0000000000000000
      [ 1567.700067] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      [ 1567.700067] CR2: 0000000000738308 CR3: 000000022e329000 CR4: 00000000000007e0
      [ 1567.700067] Stack:
      [ 1567.700067]  ffffc9000ed8e000 ffff8801d0168000 ffffc9000ed8e000 ffff8801d0168000
      [ 1567.700067]  ffff880246777a28 ffffffffad7c0a21 0000000000001080 ffff880246777c08
      [ 1567.700067]  ffff88060d302e68 ffff880246777b58 ffff880246777b88 ffffffffad9a6821
      [ 1567.700067] Call Trace:
      [ 1567.700067] build_skb (include/linux/mm.h:508 net/core/skbuff.c:316)
      [ 1567.700067] netlink_sendmsg (net/netlink/af_netlink.c:1633 net/netlink/af_netlink.c:2329)
      [ 1567.774369] ? sched_clock_cpu (kernel/sched/clock.c:311)
      [ 1567.774369] ? netlink_unicast (net/netlink/af_netlink.c:2273)
      [ 1567.774369] ? netlink_unicast (net/netlink/af_netlink.c:2273)
      [ 1567.774369] sock_sendmsg (net/socket.c:614 net/socket.c:623)
      [ 1567.774369] sock_write_iter (net/socket.c:823)
      [ 1567.774369] ? sock_sendmsg (net/socket.c:806)
      [ 1567.774369] __vfs_write (fs/read_write.c:479 fs/read_write.c:491)
      [ 1567.774369] ? get_lock_stats (kernel/locking/lockdep.c:249)
      [ 1567.774369] ? default_llseek (fs/read_write.c:487)
      [ 1567.774369] ? vtime_account_user (kernel/sched/cputime.c:701)
      [ 1567.774369] ? rw_verify_area (fs/read_write.c:406 (discriminator 4))
      [ 1567.774369] vfs_write (fs/read_write.c:539)
      [ 1567.774369] SyS_write (fs/read_write.c:586 fs/read_write.c:577)
      [ 1567.774369] ? SyS_read (fs/read_write.c:577)
      [ 1567.774369] ? __this_cpu_preempt_check (lib/smp_processor_id.c:63)
      [ 1567.774369] ? trace_hardirqs_on_caller (kernel/locking/lockdep.c:2594 kernel/locking/lockdep.c:2636)
      [ 1567.774369] ? trace_hardirqs_on_thunk (arch/x86/lib/thunk_64.S:42)
      [ 1567.774369] system_call_fastpath (arch/x86/kernel/entry_64.S:261)
      
      Fixes: 79930f58 ("net: do not deplete pfmemalloc reserve")
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Reported-by: default avatarSasha Levin <sasha.levin@oracle.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      f009181d
    • Eric Dumazet's avatar
      net: do not deplete pfmemalloc reserve · e591662c
      Eric Dumazet authored
      [ Upstream commit 79930f58 ]
      
      build_skb() should look at the page pfmemalloc status.
      If set, this means page allocator allocated this page in the
      expectation it would help to free other pages. Networking
      stack can do that only if skb->pfmemalloc is also set.
      
      Also, we must refrain using high order pages from the pfmemalloc
      reserve, so __page_frag_refill() must also use __GFP_NOMEMALLOC for
      them. Under memory pressure, using order-0 pages is probably the best
      strategy.
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      e591662c
    • Eric Dumazet's avatar
      tcp: avoid looping in tcp_send_fin() · 7e724697
      Eric Dumazet authored
      [ Upstream commit 845704a5 ]
      
      Presence of an unbound loop in tcp_send_fin() had always been hard
      to explain when analyzing crash dumps involving gigantic dying processes
      with millions of sockets.
      
      Lets try a different strategy :
      
      In case of memory pressure, try to add the FIN flag to last packet
      in write queue, even if packet was already sent. TCP stack will
      be able to deliver this FIN after a timeout event. Note that this
      FIN being delivered by a retransmit, it also carries a Push flag
      given our current implementation.
      
      By checking sk_under_memory_pressure(), we anticipate that cooking
      many FIN packets might deplete tcp memory.
      
      In the case we could not allocate a packet, even with __GFP_WAIT
      allocation, then not sending a FIN seems quite reasonable if it allows
      to get rid of this socket, free memory, and not block the process from
      eventually doing other useful work.
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      7e724697
    • Eric Dumazet's avatar
      tcp: fix possible deadlock in tcp_send_fin() · e1b095eb
      Eric Dumazet authored
      [ Upstream commit d83769a5 ]
      
      Using sk_stream_alloc_skb() in tcp_send_fin() is dangerous in
      case a huge process is killed by OOM, and tcp_mem[2] is hit.
      
      To be able to free memory we need to make progress, so this
      patch allows FIN packets to not care about tcp_mem[2], if
      skb allocation succeeded.
      
      In a follow-up patch, we might abort tcp_send_fin() infinite loop
      in case TIF_MEMDIE is set on this thread, as memory allocator
      did its best getting extra memory already.
      
      This patch reverts d22e1537 ("tcp: fix tcp fin memory accounting")
      
      Fixes: d22e1537 ("tcp: fix tcp fin memory accounting")
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      e1b095eb