1. 17 May, 2015 24 commits
    • Alistair Strachan's avatar
      staging: android: sync: Fix memory corruption in sync_timeline_signal(). · f364a04f
      Alistair Strachan authored
      [ Upstream commit 8e43c9c7 ]
      
      The android_fence_release() function checks for active sync points
      by calling list_empty() on the list head embedded on the sync
      point. However, it is only valid to use list_empty() on nodes that
      have been initialized with INIT_LIST_HEAD() or list_del_init().
      
      Because the list entry has likely been removed from the active list
      by sync_timeline_signal(), there is a good chance that this
      WARN_ON_ONCE() will be hit due to dangling pointers pointing at
      freed memory (even though the sync drivers did nothing wrong)
      and memory corruption will ensue as the list entry is removed for
      a second time, corrupting the active list.
      
      This problem can be reproduced quite easily with CONFIG_DEBUG_LIST=y
      and fences with more than one sync point.
      Signed-off-by: default avatarAlistair Strachan <alistair.strachan@imgtec.com>
      Cc: Maarten Lankhorst <maarten.lankhorst@canonical.com>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: Colin Cross <ccross@google.com>
      Cc: stable <stable@vger.kernel.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      Signed-off-by: default avatarSasha Levin <sasha.levin@oracle.com>
      f364a04f
    • Sudip Mukherjee's avatar
      staging: panel: fix lcd type · fd431a7c
      Sudip Mukherjee authored
      [ Upstream commit 2c20d92d ]
      
      the lcd type as defined in the Kconfig is not matching in the code.
      as a result the rs, rw and en pins were getting interchanged.
      Kconfig defines the value of PANEL_LCD to be 1 if we select custom
      configuration but in the code LCD_TYPE_CUSTOM is defined as 5.
      
      my hardware is LCD_TYPE_CUSTOM, but the pins were assigned to it
      as pins of LCD_TYPE_OLD, and it was not working.
      Now values are corrected with referenece to the values defined in
      Kconfig and it is working.
      checked on JHD204A lcd with LCD_TYPE_CUSTOM configuration.
      
      Cc: <stable@vger.kernel.org> # 2.6.32+
      Signed-off-by: default avatarSudip Mukherjee <sudip@vectorindia.org>
      Acked-by: default avatarWilly Tarreau <w@1wt.eu>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      Signed-off-by: default avatarSasha Levin <sasha.levin@oracle.com>
      fd431a7c
    • Huacai Chen's avatar
      MIPS: Hibernate: flush TLB entries earlier · 886b9d66
      Huacai Chen authored
      [ Upstream commit a843d00d ]
      
      We found that TLB mismatch not only happens after kernel resume, but
      also happens during snapshot restore. So move it to the beginning of
      swsusp_arch_suspend().
      Signed-off-by: default avatarHuacai Chen <chenhc@lemote.com>
      Cc: <stable@vger.kernel.org>
      Cc: Steven J. Hill <Steven.Hill@imgtec.com>
      Cc: linux-mips@linux-mips.org
      Cc: Fuxin Zhang <zhangfx@lemote.com>
      Cc: Zhangjin Wu <wuzhangjin@gmail.com>
      Cc: stable@vger.kernel.org
      Patchwork: https://patchwork.linux-mips.org/patch/9621/Signed-off-by: default avatarRalf Baechle <ralf@linux-mips.org>
      Signed-off-by: default avatarSasha Levin <sasha.levin@oracle.com>
      886b9d66
    • Huacai Chen's avatar
      MIPS: Loongson-3: Add IRQF_NO_SUSPEND to Cascade irqaction · 38d99bff
      Huacai Chen authored
      [ Upstream commit 0add9c2f ]
      
      HPET irq is routed to i8259 and then to MIPS CPU irq (cascade). After
      commit a3e6c1ef (MIPS: IRQ: Fix disable_irq on CPU IRQs), if without
      IRQF_NO_SUSPEND in cascade_irqaction, HPET interrupts will lost during
      suspend. The result is machine cannot be waken up.
      Signed-off-by: default avatarHuacai Chen <chenhc@lemote.com>
      Cc: <stable@vger.kernel.org>
      Cc: Steven J. Hill <Steven.Hill@imgtec.com>
      Cc: linux-mips@linux-mips.org
      Cc: Fuxin Zhang <zhangfx@lemote.com>
      Cc: Zhangjin Wu <wuzhangjin@gmail.com>
      Patchwork: https://patchwork.linux-mips.org/patch/9528/Signed-off-by: default avatarRalf Baechle <ralf@linux-mips.org>
      Signed-off-by: default avatarSasha Levin <sasha.levin@oracle.com>
      38d99bff
    • Markos Chandras's avatar
      MIPS: asm: asm-eva: Introduce kernel load/store variants · 301080ae
      Markos Chandras authored
      [ Upstream commit 60cd7e08 ]
      
      Introduce new macros for kernel load/store variants which will be
      used to perform regular kernel space load/store operations in EVA
      mode.
      Signed-off-by: default avatarMarkos Chandras <markos.chandras@imgtec.com>
      Cc: <stable@vger.kernel.org> # v3.15+
      Cc: linux-mips@linux-mips.org
      Patchwork: https://patchwork.linux-mips.org/patch/9500/Signed-off-by: default avatarRalf Baechle <ralf@linux-mips.org>
      Signed-off-by: default avatarSasha Levin <sasha.levin@oracle.com>
      301080ae
    • Markos Chandras's avatar
      MIPS: Malta: Detect and fix bad memsize values · 312bc67c
      Markos Chandras authored
      [ Upstream commit f7f8aea4 ]
      
      memsize denotes the amount of RAM we can access from kseg{0,1} and
      that should be up to 256M. In case the bootloader reports a value
      higher than that (perhaps reporting all the available RAM) it's best
      if we fix it ourselves and just warn the user about that. This is
      usually a problem with the bootloader and/or its environment.
      
      [ralf@linux-mips.org: Remove useless parens as suggested bei Sergei.
      Reformat long pr_warn statement to fit into 80 column limit.]
      Signed-off-by: default avatarMarkos Chandras <markos.chandras@imgtec.com>
      Cc: <stable@vger.kernel.org> # v3.15+
      Cc: linux-mips@linux-mips.org
      Patchwork: https://patchwork.linux-mips.org/patch/9362/Signed-off-by: default avatarRalf Baechle <ralf@linux-mips.org>
      Signed-off-by: default avatarSasha Levin <sasha.levin@oracle.com>
      312bc67c
    • James Hogan's avatar
      MIPS: lose_fpu(): Disable FPU when MSA enabled · 00f1187a
      James Hogan authored
      [ Upstream commit f8483988 ]
      
      The lose_fpu() function only disables the FPU in CP0_Status.CU1 if the
      FPU is in use and MSA isn't enabled.
      
      This isn't necessarily a problem because KSTK_STATUS(current), the
      version of CP0_Status stored on the kernel stack on entry from user
      mode, does always get updated and gets restored when returning to user
      mode, but I don't think it was intended, and it is inconsistent with the
      case of only the FPU being in use. Sometimes leaving the FPU enabled may
      also mask kernel bugs where FPU operations are executed when the FPU
      might not be enabled.
      
      So lets disable the FPU in the MSA case too.
      
      Fixes: 33c771ba ("MIPS: save/disable MSA in lose_fpu")
      Signed-off-by: default avatarJames Hogan <james.hogan@imgtec.com>
      Cc: Ralf Baechle <ralf@linux-mips.org>
      Cc: Paul Burton <paul.burton@imgtec.com>
      Cc: linux-mips@linux-mips.org
      Patchwork: https://patchwork.linux-mips.org/patch/9323/Signed-off-by: default avatarRalf Baechle <ralf@linux-mips.org>
      Signed-off-by: default avatarSasha Levin <sasha.levin@oracle.com>
      00f1187a
    • James Hogan's avatar
      MIPS: KVM: Handle MSA Disabled exceptions from guest · 707ff225
      James Hogan authored
      [ Upstream commit 98119ad5 ]
      
      Guest user mode can generate a guest MSA Disabled exception on an MSA
      capable core by simply trying to execute an MSA instruction. Since this
      exception is unknown to KVM it will be passed on to the guest kernel.
      However guest Linux kernels prior to v3.15 do not set up an exception
      handler for the MSA Disabled exception as they don't support any MSA
      capable cores. This results in a guest OS panic.
      
      Since an older processor ID may be being emulated, and MSA support is
      not advertised to the guest, the correct behaviour is to generate a
      Reserved Instruction exception in the guest kernel so it can send the
      guest process an illegal instruction signal (SIGILL), as would happen
      with a non-MSA-capable core.
      
      Fix this as minimally as reasonably possible by preventing
      kvm_mips_check_privilege() from relaying MSA Disabled exceptions from
      guest user mode to the guest kernel, and handling the MSA Disabled
      exception by emulating a Reserved Instruction exception in the guest,
      via a new handle_msa_disabled() KVM callback.
      Signed-off-by: default avatarJames Hogan <james.hogan@imgtec.com>
      Cc: Paolo Bonzini <pbonzini@redhat.com>
      Cc: Paul Burton <paul.burton@imgtec.com>
      Cc: Ralf Baechle <ralf@linux-mips.org>
      Cc: Gleb Natapov <gleb@kernel.org>
      Cc: linux-mips@linux-mips.org
      Cc: kvm@vger.kernel.org
      Cc: <stable@vger.kernel.org> # v3.15+
      Signed-off-by: default avatarSasha Levin <sasha.levin@oracle.com>
      707ff225
    • Andre Przywara's avatar
      KVM: arm/arm64: check IRQ number on userland injection · e580744e
      Andre Przywara authored
      [ Upstream commit fd1d0ddf ]
      
      When userland injects a SPI via the KVM_IRQ_LINE ioctl we currently
      only check it against a fixed limit, which historically is set
      to 127. With the new dynamic IRQ allocation the effective limit may
      actually be smaller (64).
      So when now a malicious or buggy userland injects a SPI in that
      range, we spill over on our VGIC bitmaps and bytemaps memory.
      I could trigger a host kernel NULL pointer dereference with current
      mainline by injecting some bogus IRQ number from a hacked kvmtool:
      -----------------
      ....
      DEBUG: kvm_vgic_inject_irq(kvm, cpu=0, irq=114, level=1)
      DEBUG: vgic_update_irq_pending(kvm, cpu=0, irq=114, level=1)
      DEBUG: IRQ #114 still in the game, writing to bytemap now...
      Unable to handle kernel NULL pointer dereference at virtual address 00000000
      pgd = ffffffc07652e000
      [00000000] *pgd=00000000f658b003, *pud=00000000f658b003, *pmd=0000000000000000
      Internal error: Oops: 96000006 [#1] PREEMPT SMP
      Modules linked in:
      CPU: 1 PID: 1053 Comm: lkvm-msi-irqinj Not tainted 4.0.0-rc7+ #3027
      Hardware name: FVP Base (DT)
      task: ffffffc0774e9680 ti: ffffffc0765a8000 task.ti: ffffffc0765a8000
      PC is at kvm_vgic_inject_irq+0x234/0x310
      LR is at kvm_vgic_inject_irq+0x30c/0x310
      pc : [<ffffffc0000ae0a8>] lr : [<ffffffc0000ae180>] pstate: 80000145
      .....
      
      So this patch fixes this by checking the SPI number against the
      actual limit. Also we remove the former legacy hard limit of
      127 in the ioctl code.
      Signed-off-by: default avatarAndre Przywara <andre.przywara@arm.com>
      Reviewed-by: default avatarChristoffer Dall <christoffer.dall@linaro.org>
      CC: <stable@vger.kernel.org> # 4.0, 3.19, 3.18
      [maz: wrap KVM_ARM_IRQ_GIC_MAX with #ifndef __KERNEL__,
      as suggested by Christopher Covington]
      Signed-off-by: default avatarMarc Zyngier <marc.zyngier@arm.com>
      Signed-off-by: default avatarSasha Levin <sasha.levin@oracle.com>
      e580744e
    • Radim Krčmář's avatar
      KVM: use slowpath for cross page cached accesses · 35e13292
      Radim Krčmář authored
      [ Upstream commit ca3f0874 ]
      
      kvm_write_guest_cached() does not mark all written pages as dirty and
      code comments in kvm_gfn_to_hva_cache_init() talk about NULL memslot
      with cross page accesses.  Fix all the easy way.
      
      The check is '<= 1' to have the same result for 'len = 0' cache anywhere
      in the page.  (nr_pages_needed is 0 on page boundary.)
      
      Fixes: 8f964525 ("KVM: Allow cross page reads and writes from cached translations.")
      Signed-off-by: default avatarRadim Krčmář <rkrcmar@redhat.com>
      Message-Id: <20150408121648.GA3519@potion.brq.redhat.com>
      Reviewed-by: default avatarWanpeng Li <wanpeng.li@linux.intel.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      Signed-off-by: default avatarSasha Levin <sasha.levin@oracle.com>
      35e13292
    • Heiko Carstens's avatar
      s390/hibernate: fix save and restore of kernel text section · 23786453
      Heiko Carstens authored
      [ Upstream commit d7441949 ]
      
      Sebastian reported a crash caused by a jump label mismatch after resume.
      This happens because we do not save the kernel text section during suspend
      and therefore also do not restore it during resume, but use the kernel image
      that restores the old system.
      
      This means that after a suspend/resume cycle we lost all modifications done
      to the kernel text section.
      The reason for this is the pfn_is_nosave() function, which incorrectly
      returns that read-only pages don't need to be saved. This is incorrect since
      we mark the kernel text section read-only.
      We still need to make sure to not save and restore pages contained within
      NSS and DCSS segment.
      To fix this add an extra case for the kernel text section and only save
      those pages if they are not contained within an NSS segment.
      
      Fixes the following crash (and the above bugs as well):
      
      Jump label code mismatch at netif_receive_skb_internal+0x28/0xd0
      Found:    c0 04 00 00 00 00
      Expected: c0 f4 00 00 00 11
      New:      c0 04 00 00 00 00
      Kernel panic - not syncing: Corrupted kernel text
      CPU: 0 PID: 9 Comm: migration/0 Not tainted 3.19.0-01975-gb1b096e70f23 #4
      Call Trace:
        [<0000000000113972>] show_stack+0x72/0xf0
        [<000000000081f15e>] dump_stack+0x6e/0x90
        [<000000000081c4e8>] panic+0x108/0x2b0
        [<000000000081be64>] jump_label_bug.isra.2+0x104/0x108
        [<0000000000112176>] __jump_label_transform+0x9e/0xd0
        [<00000000001121e6>] __sm_arch_jump_label_transform+0x3e/0x50
        [<00000000001d1136>] multi_cpu_stop+0x12e/0x170
        [<00000000001d1472>] cpu_stopper_thread+0xb2/0x168
        [<000000000015d2ac>] smpboot_thread_fn+0x134/0x1b0
        [<0000000000158baa>] kthread+0x10a/0x110
        [<0000000000824a86>] kernel_thread_starter+0x6/0xc
      Reported-and-tested-by: default avatarSebastian Ott <sebott@linux.vnet.ibm.com>
      Cc: stable@vger.kernel.org
      Signed-off-by: default avatarHeiko Carstens <heiko.carstens@de.ibm.com>
      Signed-off-by: default avatarMartin Schwidefsky <schwidefsky@de.ibm.com>
      Signed-off-by: default avatarSasha Levin <sasha.levin@oracle.com>
      23786453
    • Jens Freimann's avatar
      KVM: s390: fix get_all_floating_irqs · 00d83726
      Jens Freimann authored
      [ Upstream commit 94aa033e ]
      
      This fixes a bug introduced with commit c05c4186 ("KVM: s390:
      add floating irq controller").
      
      get_all_floating_irqs() does copy_to_user() while holding
      a spin lock. Let's fix this by filling a temporary buffer
      first and copy it to userspace after giving up the lock.
      
      Cc: <stable@vger.kernel.org> # 3.18+: 69a8d456 KVM: s390: no need to hold...
      Reviewed-by: default avatarDavid Hildenbrand <dahi@linux.vnet.ibm.com>
      Signed-off-by: default avatarJens Freimann <jfrei@linux.vnet.ibm.com>
      Signed-off-by: default avatarChristian Borntraeger <borntraeger@de.ibm.com>
      Acked-by: default avatarCornelia Huck <cornelia.huck@de.ibm.com>
      Signed-off-by: default avatarSasha Levin <sasha.levin@oracle.com>
      00d83726
    • Christian Borntraeger's avatar
      KVM: s390: no need to hold the kvm->mutex for floating interrupts · 7e15bc0e
      Christian Borntraeger authored
      [ Upstream commit 69a8d456 ]
      
      The kvm mutex was (probably) used to protect against cpu hotplug.
      The current code no longer needs to protect against that, as we only
      rely on CPU data structures that are guaranteed to be available
      if we can access the CPU. (e.g. vcpu_create will put the cpu
      in the array AFTER the cpu is ready).
      Signed-off-by: default avatarChristian Borntraeger <borntraeger@de.ibm.com>
      Acked-by: default avatarCornelia Huck <cornelia.huck@de.ibm.com>
      Reviewed-by: default avatarJens Freimann <jfrei@linux.vnet.ibm.com>
      Signed-off-by: default avatarSasha Levin <sasha.levin@oracle.com>
      7e15bc0e
    • Ekaterina Tumanova's avatar
      KVM: s390: Zero out current VMDB of STSI before including level3 data. · abd80ecb
      Ekaterina Tumanova authored
      [ Upstream commit b75f4c9a ]
      
      s390 documentation requires words 0 and 10-15 to be reserved and stored as
      zeros. As we fill out all other fields, we can memset the full structure.
      Signed-off-by: default avatarEkaterina Tumanova <tumanova@linux.vnet.ibm.com>
      Cc: stable@vger.kernel.org
      Reviewed-by: default avatarDavid Hildenbrand <dahi@linux.vnet.ibm.com>
      Signed-off-by: default avatarChristian Borntraeger <borntraeger@de.ibm.com>
      Signed-off-by: default avatarSasha Levin <sasha.levin@oracle.com>
      abd80ecb
    • David Hildenbrand's avatar
      KVM: s390: reinjection of irqs can fail in the tpi handler · b7c23d30
      David Hildenbrand authored
      [ Upstream commit 15462e37 ]
      
      The reinjection of an I/O interrupt can fail if the list is at the limit
      and between the dequeue and the reinjection, another I/O interrupt is
      injected (e.g. if user space floods kvm with I/O interrupts).
      
      This patch avoids this memory leak and returns -EFAULT in this special
      case. This error is not recoverable, so let's fail hard. This can later
      be avoided by not dequeuing the interrupt but working directly on the
      locked list.
      Signed-off-by: default avatarDavid Hildenbrand <dahi@linux.vnet.ibm.com>
      Cc: stable@vger.kernel.org # 3.16+
      Signed-off-by: default avatarChristian Borntraeger <borntraeger@de.ibm.com>
      Signed-off-by: default avatarSasha Levin <sasha.levin@oracle.com>
      b7c23d30
    • David Hildenbrand's avatar
      KVM: s390: fix handling of write errors in the tpi handler · 19881aff
      David Hildenbrand authored
      [ Upstream commit 261520dc ]
      
      If the I/O interrupt could not be written to the guest provided
      area (e.g. access exception), a program exception was injected into the
      guest but "inti" wasn't freed, therefore resulting in a memory leak.
      
      In addition, the I/O interrupt wasn't reinjected. Therefore the dequeued
      interrupt is lost.
      
      This patch fixes the problem while cleaning up the function and making the
      cc and rc logic easier to handle.
      Signed-off-by: default avatarDavid Hildenbrand <dahi@linux.vnet.ibm.com>
      Cc: stable@vger.kernel.org # 3.16+
      Signed-off-by: default avatarChristian Borntraeger <borntraeger@de.ibm.com>
      Signed-off-by: default avatarSasha Levin <sasha.levin@oracle.com>
      19881aff
    • Andrzej Pietrasiewicz's avatar
      usb: gadget: printer: enqueue printer's response for setup request · 67c5b95c
      Andrzej Pietrasiewicz authored
      [ Upstream commit eb132ccb ]
      
      Function-specific setup requests should be handled in such a way, that
      apart from filling in the data buffer, the requests are also actually
      enqueued: if function-specific setup is called from composte_setup(),
      the "usb_ep_queue()" block of code in composite_setup() is skipped.
      
      The printer function lacks this part and it results in e.g. get device id
      requests failing: the host expects some response, the device prepares it
      but does not equeue it for sending to the host, so the host finally asserts
      timeout.
      
      This patch adds enqueueing the prepared responses.
      
      Cc: <stable@vger.kernel.org> # v3.4+
      Fixes: 2e87edf4: "usb: gadget: make g_printer use composite"
      Signed-off-by: default avatarAndrzej Pietrasiewicz <andrzej.p@samsung.com>
      Signed-off-by: default avatarFelipe Balbi <balbi@ti.com>
      Signed-off-by: default avatarSasha Levin <sasha.levin@oracle.com>
      67c5b95c
    • Filipe Manana's avatar
      Btrfs: fix inode eviction infinite loop after extent_same ioctl · 43e8149d
      Filipe Manana authored
      [ Upstream commit HEAD ]
      
      commit 113e8283 upstream.
      
      If we pass a length of 0 to the extent_same ioctl, we end up locking an
      extent range with a start offset greater then its end offset (if the
      destination file's offset is greater than zero). This results in a warning
      from extent_io.c:insert_state through the following call chain:
      
        btrfs_extent_same()
          btrfs_double_lock()
            lock_extent_range()
              lock_extent(inode->io_tree, offset, offset + len - 1)
                lock_extent_bits()
                  __set_extent_bit()
                    insert_state()
                      --> WARN_ON(end < start)
      
      This leads to an infinite loop when evicting the inode. This is the same
      problem that my previous patch titled
      "Btrfs: fix inode eviction infinite loop after cloning into it" addressed
      but for the extent_same ioctl instead of the clone ioctl.
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Reviewed-by: default avatarOmar Sandoval <osandov@osandov.com>
      Signed-off-by: default avatarChris Mason <clm@fb.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      
      (cherry picked from commit 9dc10661)
      Signed-off-by: default avatarSasha Levin <sasha.levin@oracle.com>
      43e8149d
    • Filipe Manana's avatar
      Btrfs: fix inode eviction infinite loop after cloning into it · d5454242
      Filipe Manana authored
      [ Upstream commit HEAD ]
      
      commit ccccf3d6 upstream.
      
      If we attempt to clone a 0 length region into a file we can end up
      inserting a range in the inode's extent_io tree with a start offset
      that is greater then the end offset, which triggers immediately the
      following warning:
      
      [ 3914.619057] WARNING: CPU: 17 PID: 4199 at fs/btrfs/extent_io.c:435 insert_state+0x4b/0x10b [btrfs]()
      [ 3914.620886] BTRFS: end < start 4095 4096
      (...)
      [ 3914.638093] Call Trace:
      [ 3914.638636]  [<ffffffff81425fd9>] dump_stack+0x4c/0x65
      [ 3914.639620]  [<ffffffff81045390>] warn_slowpath_common+0xa1/0xbb
      [ 3914.640789]  [<ffffffffa03ca44f>] ? insert_state+0x4b/0x10b [btrfs]
      [ 3914.642041]  [<ffffffff810453f0>] warn_slowpath_fmt+0x46/0x48
      [ 3914.643236]  [<ffffffffa03ca44f>] insert_state+0x4b/0x10b [btrfs]
      [ 3914.644441]  [<ffffffffa03ca729>] __set_extent_bit+0x107/0x3f4 [btrfs]
      [ 3914.645711]  [<ffffffffa03cb256>] lock_extent_bits+0x65/0x1bf [btrfs]
      [ 3914.646914]  [<ffffffff8142b2fb>] ? _raw_spin_unlock+0x28/0x33
      [ 3914.648058]  [<ffffffffa03cbac4>] ? test_range_bit+0xcc/0xde [btrfs]
      [ 3914.650105]  [<ffffffffa03cb3c3>] lock_extent+0x13/0x15 [btrfs]
      [ 3914.651361]  [<ffffffffa03db39e>] lock_extent_range+0x3d/0xcd [btrfs]
      [ 3914.652761]  [<ffffffffa03de1fe>] btrfs_ioctl_clone+0x278/0x388 [btrfs]
      [ 3914.654128]  [<ffffffff811226dd>] ? might_fault+0x58/0xb5
      [ 3914.655320]  [<ffffffffa03e0909>] btrfs_ioctl+0xb51/0x2195 [btrfs]
      (...)
      [ 3914.669271] ---[ end trace 14843d3e2e622fc1 ]---
      
      This later makes the inode eviction handler enter an infinite loop that
      keeps dumping the following warning over and over:
      
      [ 3915.117629] WARNING: CPU: 22 PID: 4228 at fs/btrfs/extent_io.c:435 insert_state+0x4b/0x10b [btrfs]()
      [ 3915.119913] BTRFS: end < start 4095 4096
      (...)
      [ 3915.137394] Call Trace:
      [ 3915.137913]  [<ffffffff81425fd9>] dump_stack+0x4c/0x65
      [ 3915.139154]  [<ffffffff81045390>] warn_slowpath_common+0xa1/0xbb
      [ 3915.140316]  [<ffffffffa03ca44f>] ? insert_state+0x4b/0x10b [btrfs]
      [ 3915.141505]  [<ffffffff810453f0>] warn_slowpath_fmt+0x46/0x48
      [ 3915.142709]  [<ffffffffa03ca44f>] insert_state+0x4b/0x10b [btrfs]
      [ 3915.143849]  [<ffffffffa03ca729>] __set_extent_bit+0x107/0x3f4 [btrfs]
      [ 3915.145120]  [<ffffffffa038c1e3>] ? btrfs_kill_super+0x17/0x23 [btrfs]
      [ 3915.146352]  [<ffffffff811548f6>] ? deactivate_locked_super+0x3b/0x50
      [ 3915.147565]  [<ffffffffa03cb256>] lock_extent_bits+0x65/0x1bf [btrfs]
      [ 3915.148785]  [<ffffffff8142b7e2>] ? _raw_write_unlock+0x28/0x33
      [ 3915.149931]  [<ffffffffa03bc325>] btrfs_evict_inode+0x196/0x482 [btrfs]
      [ 3915.151154]  [<ffffffff81168904>] evict+0xa0/0x148
      [ 3915.152094]  [<ffffffff811689e5>] dispose_list+0x39/0x43
      [ 3915.153081]  [<ffffffff81169564>] evict_inodes+0xdc/0xeb
      [ 3915.154062]  [<ffffffff81154418>] generic_shutdown_super+0x49/0xef
      [ 3915.155193]  [<ffffffff811546d1>] kill_anon_super+0x13/0x1e
      [ 3915.156274]  [<ffffffffa038c1e3>] btrfs_kill_super+0x17/0x23 [btrfs]
      (...)
      [ 3915.167404] ---[ end trace 14843d3e2e622fc2 ]---
      
      So just bail out of the clone ioctl if the length of the region to clone
      is zero, without locking any extent range, in order to prevent this issue
      (same behaviour as a pwrite with a 0 length for example).
      
      This is trivial to reproduce. For example, the steps for the test I just
      made for fstests:
      
        mkfs.btrfs -f SCRATCH_DEV
        mount SCRATCH_DEV $SCRATCH_MNT
      
        touch $SCRATCH_MNT/foo
        touch $SCRATCH_MNT/bar
      
        $CLONER_PROG -s 0 -d 4096 -l 0 $SCRATCH_MNT/foo $SCRATCH_MNT/bar
        umount $SCRATCH_MNT
      
      A test case for fstests follows soon.
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Reviewed-by: default avatarOmar Sandoval <osandov@osandov.com>
      Signed-off-by: default avatarChris Mason <clm@fb.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      
      (cherry picked from commit 449b4627)
      Signed-off-by: default avatarSasha Levin <sasha.levin@oracle.com>
      d5454242
    • David Sterba's avatar
      btrfs: don't accept bare namespace as a valid xattr · 5728a924
      David Sterba authored
      [ Upstream commit HEAD ]
      
      commit 3c3b04d1 upstream.
      
      Due to insufficient check in btrfs_is_valid_xattr, this unexpectedly
      works:
      
       $ touch file
       $ setfattr -n user. -v 1 file
       $ getfattr -d file
      user.="1"
      
      ie. the missing attribute name after the namespace.
      
      Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=94291Reported-by: default avatarWilliam Douglas <william.douglas@intel.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.cz>
      Signed-off-by: default avatarChris Mason <clm@fb.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      
      (cherry picked from commit 1bb2835e)
      Signed-off-by: default avatarSasha Levin <sasha.levin@oracle.com>
      5728a924
    • Filipe Manana's avatar
      Btrfs: fix log tree corruption when fs mounted with -o discard · 372b2ac5
      Filipe Manana authored
      [ Upstream commit HEAD ]
      
      commit dcc82f47 upstream.
      
      While committing a transaction we free the log roots before we write the
      new super block. Freeing the log roots implies marking the disk location
      of every node/leaf (metadata extent) as pinned before the new super block
      is written. This is to prevent the disk location of log metadata extents
      from being reused before the new super block is written, otherwise we
      would have a corrupted log tree if before the new super block is written
      a crash/reboot happens and the location of any log tree metadata extent
      ended up being reused and rewritten.
      
      Even though we pinned the log tree's metadata extents, we were issuing a
      discard against them if the fs was mounted with the -o discard option,
      resulting in corruption of the log tree if a crash/reboot happened before
      writing the new super block - the next time the fs was mounted, during
      the log replay process we would find nodes/leafs of the log btree with
      a content full of zeroes, causing the process to fail and require the
      use of the tool btrfs-zero-log to wipeout the log tree (and all data
      previously fsynced becoming lost forever).
      
      Fix this by not doing a discard when pinning an extent. The discard will
      be done later when it's safe (after the new super block is committed) at
      extent-tree.c:btrfs_finish_extent_commit().
      
      Fixes: e688b725 (Btrfs: fix extent pinning bugs in the tree log)
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarChris Mason <clm@fb.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      
      (cherry picked from commit 3909e5a9)
      Signed-off-by: default avatarSasha Levin <sasha.levin@oracle.com>
      372b2ac5
    • Nadav Amit's avatar
      KVM: x86: Fix MSR_IA32_BNDCFGS in msrs_to_save · 753fd54a
      Nadav Amit authored
      [ Upstream commit HEAD ]
      
      commit 9e9c3fe4 upstream.
      
      kvm_init_msr_list is currently called before hardware_setup. As a result,
      vmx_mpx_supported always returns false when kvm_init_msr_list checks whether to
      save MSR_IA32_BNDCFGS.
      
      Move kvm_init_msr_list after vmx_hardware_setup is called to fix this issue.
      Signed-off-by: default avatarNadav Amit <namit@cs.technion.ac.il>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      
      Message-Id: <1428864435-4732-1-git-send-email-namit@cs.technion.ac.il>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      
      (cherry picked from commit 702a71cf)
      Signed-off-by: default avatarSasha Levin <sasha.levin@oracle.com>
      753fd54a
    • Mike Galbraith's avatar
      sched/idle/x86: Optimize unnecessary mwait_idle() resched IPIs · 6cbb41b1
      Mike Galbraith authored
      [ Upstream commit f8e617f4 ]
      
      To fully take advantage of MWAIT, apparently the CLFLUSH instruction needs
      another quirk on certain CPUs: proper barriers around it on certain machines.
      
      On a Q6600 SMP system, pipe-test scheduling performance, cross core,
      improves significantly:
      
        3.8.13                   487.2 KHz    1.000
        3.13.0-master            415.5 KHz     .852
        3.13.0-master+           415.2 KHz     .852     + restore mwait_idle
        3.13.0-master++          488.5 KHz    1.002     + restore mwait_idle + IPI fix
      
      Since X86_BUG_CLFLUSH_MONITOR is already a quirk, don't create a separate
      quirk for the extra smp_mb()s.
      Signed-off-by: default avatarMike Galbraith <bitbucket@online.de>
      Cc: <stable@vger.kernel.org> # 3.10+
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Ian Malone <ibmalone@gmail.com>
      Cc: Josh Boyer <jwboyer@redhat.com>
      Cc: Len Brown <len.brown@intel.com>
      Cc: Len Brown <lenb@kernel.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Link: http://lkml.kernel.org/r/1390061684.5566.4.camel@marge.simpson.net
      [ Ported to recent kernel, added comments about the quirk. ]
      Signed-off-by: default avatarIngo Molnar <mingo@kernel.org>
      Signed-off-by: default avatarSasha Levin <sasha.levin@oracle.com>
      6cbb41b1
    • Len Brown's avatar
      sched/idle/x86: Restore mwait_idle() to fix boot hangs, to improve power... · 560e6448
      Len Brown authored
      sched/idle/x86: Restore mwait_idle() to fix boot hangs, to improve power savings and to improve performance
      
      [ Upstream commit b253149b ]
      
      In Linux-3.9 we removed the mwait_idle() loop:
      
        69fb3676 ("x86 idle: remove mwait_idle() and "idle=mwait" cmdline param")
      
      The reasoning was that modern machines should be sufficiently
      happy during the boot process using the default_idle() HALT
      loop, until cpuidle loads and either acpi_idle or intel_idle
      invoke the newer MWAIT-with-hints idle loop.
      
      But two machines reported problems:
      
       1. Certain Core2-era machines support MWAIT-C1 and HALT only.
          MWAIT-C1 is preferred for optimal power and performance.
          But if they support just C1, cpuidle never loads and
          so they use the boot-time default idle loop forever.
      
       2. Some laptops will boot-hang if HALT is used,
          but will boot successfully if MWAIT is used.
          This appears to be a hidden assumption in BIOS SMI,
          that is presumably valid on the proprietary OS
          where the BIOS was validated.
      
             https://bugzilla.kernel.org/show_bug.cgi?id=60770
      
      So here we effectively revert the patch above, restoring
      the mwait_idle() loop.  However, we don't bother restoring
      the idle=mwait cmdline parameter, since it appears to add
      no value.
      
      Maintainer notes:
      
        For 3.9, simply revert 69fb3676
        for 3.10, patch -F3 applies, fuzz needed due to __cpuinit use in
        context For 3.11, 3.12, 3.13, this patch applies cleanly
      Tested-by: default avatarMike Galbraith <bitbucket@online.de>
      Signed-off-by: default avatarLen Brown <len.brown@intel.com>
      Acked-by: default avatarMike Galbraith <bitbucket@online.de>
      Cc: <stable@vger.kernel.org> # 3.9+
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Ian Malone <ibmalone@gmail.com>
      Cc: Josh Boyer <jwboyer@redhat.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Link: http://lkml.kernel.org/r/345254a551eb5a6a866e048d7ab570fd2193aca4.1389763084.git.len.brown@intel.com
      [ Ported to recent kernels. ]
      Signed-off-by: default avatarIngo Molnar <mingo@kernel.org>
      Signed-off-by: default avatarSasha Levin <sasha.levin@oracle.com>
      560e6448
  2. 11 May, 2015 16 commits
    • Eric Dumazet's avatar
      net: fix crash in build_skb() · 0ff99ba9
      Eric Dumazet authored
      [ Upstream commit 2ea2f62c ]
      
      When I added pfmemalloc support in build_skb(), I forgot netlink
      was using build_skb() with a vmalloc() area.
      
      In this patch I introduce __build_skb() for netlink use,
      and build_skb() is a wrapper handling both skb->head_frag and
      skb->pfmemalloc
      
      This means netlink no longer has to hack skb->head_frag
      
      [ 1567.700067] kernel BUG at arch/x86/mm/physaddr.c:26!
      [ 1567.700067] invalid opcode: 0000 [#1] PREEMPT SMP KASAN
      [ 1567.700067] Dumping ftrace buffer:
      [ 1567.700067]    (ftrace buffer empty)
      [ 1567.700067] Modules linked in:
      [ 1567.700067] CPU: 9 PID: 16186 Comm: trinity-c182 Not tainted 4.0.0-next-20150424-sasha-00037-g4796e21 #2167
      [ 1567.700067] task: ffff880127efb000 ti: ffff880246770000 task.ti: ffff880246770000
      [ 1567.700067] RIP: __phys_addr (arch/x86/mm/physaddr.c:26 (discriminator 3))
      [ 1567.700067] RSP: 0018:ffff8802467779d8  EFLAGS: 00010202
      [ 1567.700067] RAX: 000041000ed8e000 RBX: ffffc9008ed8e000 RCX: 000000000000002c
      [ 1567.700067] RDX: 0000000000000004 RSI: 0000000000000000 RDI: ffffffffb3fd6049
      [ 1567.700067] RBP: ffff8802467779f8 R08: 0000000000000019 R09: ffff8801d0168000
      [ 1567.700067] R10: ffff8801d01680c7 R11: ffffed003a02d019 R12: ffffc9000ed8e000
      [ 1567.700067] R13: 0000000000000f40 R14: 0000000000001180 R15: ffffc9000ed8e000
      [ 1567.700067] FS:  00007f2a7da3f700(0000) GS:ffff8801d1000000(0000) knlGS:0000000000000000
      [ 1567.700067] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      [ 1567.700067] CR2: 0000000000738308 CR3: 000000022e329000 CR4: 00000000000007e0
      [ 1567.700067] Stack:
      [ 1567.700067]  ffffc9000ed8e000 ffff8801d0168000 ffffc9000ed8e000 ffff8801d0168000
      [ 1567.700067]  ffff880246777a28 ffffffffad7c0a21 0000000000001080 ffff880246777c08
      [ 1567.700067]  ffff88060d302e68 ffff880246777b58 ffff880246777b88 ffffffffad9a6821
      [ 1567.700067] Call Trace:
      [ 1567.700067] build_skb (include/linux/mm.h:508 net/core/skbuff.c:316)
      [ 1567.700067] netlink_sendmsg (net/netlink/af_netlink.c:1633 net/netlink/af_netlink.c:2329)
      [ 1567.774369] ? sched_clock_cpu (kernel/sched/clock.c:311)
      [ 1567.774369] ? netlink_unicast (net/netlink/af_netlink.c:2273)
      [ 1567.774369] ? netlink_unicast (net/netlink/af_netlink.c:2273)
      [ 1567.774369] sock_sendmsg (net/socket.c:614 net/socket.c:623)
      [ 1567.774369] sock_write_iter (net/socket.c:823)
      [ 1567.774369] ? sock_sendmsg (net/socket.c:806)
      [ 1567.774369] __vfs_write (fs/read_write.c:479 fs/read_write.c:491)
      [ 1567.774369] ? get_lock_stats (kernel/locking/lockdep.c:249)
      [ 1567.774369] ? default_llseek (fs/read_write.c:487)
      [ 1567.774369] ? vtime_account_user (kernel/sched/cputime.c:701)
      [ 1567.774369] ? rw_verify_area (fs/read_write.c:406 (discriminator 4))
      [ 1567.774369] vfs_write (fs/read_write.c:539)
      [ 1567.774369] SyS_write (fs/read_write.c:586 fs/read_write.c:577)
      [ 1567.774369] ? SyS_read (fs/read_write.c:577)
      [ 1567.774369] ? __this_cpu_preempt_check (lib/smp_processor_id.c:63)
      [ 1567.774369] ? trace_hardirqs_on_caller (kernel/locking/lockdep.c:2594 kernel/locking/lockdep.c:2636)
      [ 1567.774369] ? trace_hardirqs_on_thunk (arch/x86/lib/thunk_64.S:42)
      [ 1567.774369] system_call_fastpath (arch/x86/kernel/entry_64.S:261)
      
      Fixes: 79930f58 ("net: do not deplete pfmemalloc reserve")
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Reported-by: default avatarSasha Levin <sasha.levin@oracle.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarSasha Levin <sasha.levin@oracle.com>
      0ff99ba9
    • Eric Dumazet's avatar
      net: do not deplete pfmemalloc reserve · cfe7befc
      Eric Dumazet authored
      [ Upstream commit 79930f58 ]
      
      build_skb() should look at the page pfmemalloc status.
      If set, this means page allocator allocated this page in the
      expectation it would help to free other pages. Networking
      stack can do that only if skb->pfmemalloc is also set.
      
      Also, we must refrain using high order pages from the pfmemalloc
      reserve, so __page_frag_refill() must also use __GFP_NOMEMALLOC for
      them. Under memory pressure, using order-0 pages is probably the best
      strategy.
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarSasha Levin <sasha.levin@oracle.com>
      cfe7befc
    • Eric Dumazet's avatar
      tcp: avoid looping in tcp_send_fin() · b5635e45
      Eric Dumazet authored
      [ Upstream commit 845704a5 ]
      
      Presence of an unbound loop in tcp_send_fin() had always been hard
      to explain when analyzing crash dumps involving gigantic dying processes
      with millions of sockets.
      
      Lets try a different strategy :
      
      In case of memory pressure, try to add the FIN flag to last packet
      in write queue, even if packet was already sent. TCP stack will
      be able to deliver this FIN after a timeout event. Note that this
      FIN being delivered by a retransmit, it also carries a Push flag
      given our current implementation.
      
      By checking sk_under_memory_pressure(), we anticipate that cooking
      many FIN packets might deplete tcp memory.
      
      In the case we could not allocate a packet, even with __GFP_WAIT
      allocation, then not sending a FIN seems quite reasonable if it allows
      to get rid of this socket, free memory, and not block the process from
      eventually doing other useful work.
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarSasha Levin <sasha.levin@oracle.com>
      b5635e45
    • Eric Dumazet's avatar
      tcp: fix possible deadlock in tcp_send_fin() · 2732443a
      Eric Dumazet authored
      [ Upstream commit d83769a5 ]
      
      Using sk_stream_alloc_skb() in tcp_send_fin() is dangerous in
      case a huge process is killed by OOM, and tcp_mem[2] is hit.
      
      To be able to free memory we need to make progress, so this
      patch allows FIN packets to not care about tcp_mem[2], if
      skb allocation succeeded.
      
      In a follow-up patch, we might abort tcp_send_fin() infinite loop
      in case TIF_MEMDIE is set on this thread, as memory allocator
      did its best getting extra memory already.
      
      This patch reverts d22e1537 ("tcp: fix tcp fin memory accounting")
      
      Fixes: d22e1537 ("tcp: fix tcp fin memory accounting")
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarSasha Levin <sasha.levin@oracle.com>
      2732443a
    • Tom Herbert's avatar
      ppp: call skb_checksum_complete_unset in ppp_receive_frame · 93cc4421
      Tom Herbert authored
      [ Upstream commit 3dfb0534 ]
      
      Call checksum_complete_unset in PPP receive to discard checksum-complete
      value. PPP does not pull checksum for headers and also modifies packet
      as in VJ compression.
      Signed-off-by: default avatarTom Herbert <tom@herbertland.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarSasha Levin <sasha.levin@oracle.com>
      93cc4421
    • Tom Herbert's avatar
      net: add skb_checksum_complete_unset · dc707148
      Tom Herbert authored
      [ Upstream commit 4e18b9ad ]
      
      This function changes ip_summed to CHECKSUM_NONE if CHECKSUM_COMPLETE
      is set. This is called to discard checksum-complete when packet
      is being modified and checksum is not pulled for headers in a layer.
      Signed-off-by: default avatarTom Herbert <tom@herbertland.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarSasha Levin <sasha.levin@oracle.com>
      dc707148
    • Sebastian Pöhn's avatar
      ip_forward: Drop frames with attached skb->sk · 7aca2472
      Sebastian Pöhn authored
      [ Upstream commit 2ab95749 ]
      
      Initial discussion was:
      [FYI] xfrm: Don't lookup sk_policy for timewait sockets
      
      Forwarded frames should not have a socket attached. Especially
      tw sockets will lead to panics later-on in the stack.
      
      This was observed with TPROXY assigning a tw socket and broken
      policy routing (misconfigured). As a result frame enters
      forwarding path instead of input. We cannot solve this in
      TPROXY as it cannot know that policy routing is broken.
      
      v2:
      Remove useless comment
      Signed-off-by: default avatarSebastian Poehn <sebastian.poehn@gmail.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarSasha Levin <sasha.levin@oracle.com>
      7aca2472
    • David S. Miller's avatar
      ipv4: Missing sk_nulls_node_init() in ping_unhash(). · e13f6f2b
      David S. Miller authored
      [ Upstream commit a134f083 ]
      
      If we don't do that, then the poison value is left in the ->pprev
      backlink.
      
      This can cause crashes if we do a disconnect, followed by a connect().
      Tested-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Reported-by: default avatarWen Xu <hotdog3645@gmail.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarSasha Levin <sasha.levin@oracle.com>
      e13f6f2b
    • Ido Shamay's avatar
      net/mlx4_en: Schedule napi when RX buffers allocation fails · b64bab57
      Ido Shamay authored
      [ Upstream commit 07841f9d ]
      
      When system is out of memory, refilling of RX buffers fails while
      the driver continue to pass the received packets to the kernel stack.
      At some point, when all RX buffers deplete, driver may fall into a
      sleep, and not recover when memory for new RX buffers is once again
      availible. This is because hardware does not have valid descriptors,
      so no interrupt will be generated for the driver to return to work
      in napi context. Fix it by schedule the napi poll function from
      stats_task delayed workqueue, as long as the allocations fail.
      Signed-off-by: default avatarIdo Shamay <idos@mellanox.com>
      Signed-off-by: default avatarAmir Vadai <amirv@mellanox.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarSasha Levin <sasha.levin@oracle.com>
      b64bab57
    • Benjamin Poirier's avatar
      mlx4: Fix tx ring affinity_mask creation · 09f2ece8
      Benjamin Poirier authored
      [ Upstream commit 42eab005 ]
      
      By default, the number of tx queues is limited by the number of online cpus
      in mlx4_en_get_profile(). However, this limit no longer holds after the
      ethtool .set_channels method has been called. In that situation, the driver
      may access invalid bits of certain cpumask variables when queue_index >=
      nr_cpu_ids.
      Signed-off-by: default avatarBenjamin Poirier <bpoirier@suse.de>
      Acked-by: default avatarIdo Shamay <idos@mellanox.com>
      Fixes: d03a68f8 ("net/mlx4_en: Configure the XPS queue mapping on driver load")
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarSasha Levin <sasha.levin@oracle.com>
      09f2ece8
    • Christoffer Dall's avatar
      arm/arm64: KVM: Keep elrsr/aisr in sync with software model · 5c0ac4b5
      Christoffer Dall authored
      commit ae705930 upstream.
      
      There is an interesting bug in the vgic code, which manifests itself
      when the KVM run loop has a signal pending or needs a vmid generation
      rollover after having disabled interrupts but before actually switching
      to the guest.
      
      In this case, we flush the vgic as usual, but we sync back the vgic
      state and exit to userspace before entering the guest.  The consequence
      is that we will be syncing the list registers back to the software model
      using the GICH_ELRSR and GICH_EISR from the last execution of the guest,
      potentially overwriting a list register containing an interrupt.
      
      This showed up during migration testing where we would capture a state
      where the VM has masked the arch timer but there were no interrupts,
      resulting in a hung test.
      
      Cc: Marc Zyngier <marc.zyngier@arm.com>
      Reported-by: default avatarAlex Bennee <alex.bennee@linaro.org>
      Signed-off-by: default avatarChristoffer Dall <christoffer.dall@linaro.org>
      Signed-off-by: default avatarAlex Bennée <alex.bennee@linaro.org>
      Acked-by: default avatarMarc Zyngier <marc.zyngier@arm.com>
      Signed-off-by: default avatarChristoffer Dall <christoffer.dall@linaro.org>
      Signed-off-by: default avatarShannon Zhao <shannon.zhao@linaro.org>
      Signed-off-by: default avatarSasha Levin <sasha.levin@oracle.com>
      5c0ac4b5
    • Marc Zyngier's avatar
      arm64: KVM: Do not use pgd_index to index stage-2 pgd · 74e0fede
      Marc Zyngier authored
      commit 04b8dc85 upstream.
      
      The kernel's pgd_index macro is designed to index a normal, page
      sized array. KVM is a bit diffferent, as we can use concatenated
      pages to have a bigger address space (for example 40bit IPA with
      4kB pages gives us an 8kB PGD.
      
      In the above case, the use of pgd_index will always return an index
      inside the first 4kB, which makes a guest that has memory above
      0x8000000000 rather unhappy, as it spins forever in a page fault,
      whist the host happilly corrupts the lower pgd.
      
      The obvious fix is to get our own kvm_pgd_index that does the right
      thing(tm).
      
      Tested on X-Gene with a hacked kvmtool that put memory at a stupidly
      high address.
      Reviewed-by: default avatarChristoffer Dall <christoffer.dall@linaro.org>
      Signed-off-by: default avatarMarc Zyngier <marc.zyngier@arm.com>
      Signed-off-by: default avatarChristoffer Dall <christoffer.dall@linaro.org>
      Signed-off-by: default avatarShannon Zhao <shannon.zhao@linaro.org>
      Signed-off-by: default avatarSasha Levin <sasha.levin@oracle.com>
      74e0fede
    • Marc Zyngier's avatar
      arm64: KVM: Fix stage-2 PGD allocation to have per-page refcounting · 354883e3
      Marc Zyngier authored
      commit a987370f upstream.
      
      We're using __get_free_pages with to allocate the guest's stage-2
      PGD. The standard behaviour of this function is to return a set of
      pages where only the head page has a valid refcount.
      
      This behaviour gets us into trouble when we're trying to increment
      the refcount on a non-head page:
      
      page:ffff7c00cfb693c0 count:0 mapcount:0 mapping:          (null) index:0x0
      flags: 0x4000000000000000()
      page dumped because: VM_BUG_ON_PAGE((*({ __attribute__((unused)) typeof((&page->_count)->counter) __var = ( typeof((&page->_count)->counter)) 0; (volatile typeof((&page->_count)->counter) *)&((&page->_count)->counter); })) <= 0)
      BUG: failure at include/linux/mm.h:548/get_page()!
      Kernel panic - not syncing: BUG!
      CPU: 1 PID: 1695 Comm: kvm-vcpu-0 Not tainted 4.0.0-rc1+ #3825
      Hardware name: APM X-Gene Mustang board (DT)
      Call trace:
      [<ffff80000008a09c>] dump_backtrace+0x0/0x13c
      [<ffff80000008a1e8>] show_stack+0x10/0x1c
      [<ffff800000691da8>] dump_stack+0x74/0x94
      [<ffff800000690d78>] panic+0x100/0x240
      [<ffff8000000a0bc4>] stage2_get_pmd+0x17c/0x2bc
      [<ffff8000000a1dc4>] kvm_handle_guest_abort+0x4b4/0x6b0
      [<ffff8000000a420c>] handle_exit+0x58/0x180
      [<ffff80000009e7a4>] kvm_arch_vcpu_ioctl_run+0x114/0x45c
      [<ffff800000099df4>] kvm_vcpu_ioctl+0x2e0/0x754
      [<ffff8000001c0a18>] do_vfs_ioctl+0x424/0x5c8
      [<ffff8000001c0bfc>] SyS_ioctl+0x40/0x78
      CPU0: stopping
      
      A possible approach for this is to split the compound page using
      split_page() at allocation time, and change the teardown path to
      free one page at a time.  It turns out that alloc_pages_exact() and
      free_pages_exact() does exactly that.
      
      While we're at it, the PGD allocation code is reworked to reduce
      duplication.
      
      This has been tested on an X-Gene platform with a 4kB/48bit-VA host
      kernel, and kvmtool hacked to place memory in the second page of
      the hardware PGD (PUD for the host kernel). Also regression-tested
      on a Cubietruck (Cortex-A7).
      
       [ Reworked to use alloc_pages_exact() and free_pages_exact() and to
         return pointers directly instead of by reference as arguments
          - Christoffer ]
      Reported-by: default avatarMark Rutland <mark.rutland@arm.com>
      Signed-off-by: default avatarMarc Zyngier <marc.zyngier@arm.com>
      Signed-off-by: default avatarChristoffer Dall <christoffer.dall@linaro.org>
      Signed-off-by: default avatarShannon Zhao <shannon.zhao@linaro.org>
      Signed-off-by: default avatarSasha Levin <sasha.levin@oracle.com>
      354883e3
    • Jan Kiszka's avatar
      ARM: KVM: Fix size check in __coherent_cache_guest_page · f10b9c8a
      Jan Kiszka authored
      commit a050dfb2 upstream.
      
      The check is supposed to catch page-unaligned sizes, not the inverse.
      Signed-off-by: default avatarJan Kiszka <jan.kiszka@siemens.com>
      Signed-off-by: default avatarChristoffer Dall <christoffer.dall@linaro.org>
      Signed-off-by: default avatarShannon Zhao <shannon.zhao@linaro.org>
      Signed-off-by: default avatarSasha Levin <sasha.levin@oracle.com>
      f10b9c8a
    • Marc Zyngier's avatar
      arm/arm64: KVM: Use kernel mapping to perform invalidation on page fault · a49ecf87
      Marc Zyngier authored
      commit 0d3e4d4f upstream.
      
      When handling a fault in stage-2, we need to resync I$ and D$, just
      to be sure we don't leave any old cache line behind.
      
      That's very good, except that we do so using the *user* address.
      Under heavy load (swapping like crazy), we may end up in a situation
      where the page gets mapped in stage-2 while being unmapped from
      userspace by another CPU.
      
      At that point, the DC/IC instructions can generate a fault, which
      we handle with kvm->mmu_lock held. The box quickly deadlocks, user
      is unhappy.
      
      Instead, perform this invalidation through the kernel mapping,
      which is guaranteed to be present. The box is much happier, and so
      am I.
      Signed-off-by: default avatarMarc Zyngier <marc.zyngier@arm.com>
      Signed-off-by: default avatarChristoffer Dall <christoffer.dall@linaro.org>
      Signed-off-by: default avatarShannon Zhao <shannon.zhao@linaro.org>
      Signed-off-by: default avatarSasha Levin <sasha.levin@oracle.com>
      a49ecf87
    • Marc Zyngier's avatar
      arm/arm64: KVM: Invalidate data cache on unmap · a412dc06
      Marc Zyngier authored
      commit 363ef89f upstream.
      
      Let's assume a guest has created an uncached mapping, and written
      to that page. Let's also assume that the host uses a cache-coherent
      IO subsystem. Let's finally assume that the host is under memory
      pressure and starts to swap things out.
      
      Before this "uncached" page is evicted, we need to make sure
      we invalidate potential speculated, clean cache lines that are
      sitting there, or the IO subsystem is going to swap out the
      cached view, loosing the data that has been written directly
      into memory.
      Signed-off-by: default avatarMarc Zyngier <marc.zyngier@arm.com>
      Signed-off-by: default avatarChristoffer Dall <christoffer.dall@linaro.org>
      Signed-off-by: default avatarShannon Zhao <shannon.zhao@linaro.org>
      Signed-off-by: default avatarSasha Levin <sasha.levin@oracle.com>
      a412dc06