1. 15 Jul, 2017 25 commits
  2. 05 Jul, 2017 15 commits
    • Greg Kroah-Hartman's avatar
      Linux 4.4.76 · 4282d395
      Greg Kroah-Hartman authored
      4282d395
    • Wanpeng Li's avatar
      KVM: nVMX: Fix exception injection · be8c39b4
      Wanpeng Li authored
      commit d4912215 upstream.
      
       WARNING: CPU: 3 PID: 2840 at arch/x86/kvm/vmx.c:10966 nested_vmx_vmexit+0xdcd/0xde0 [kvm_intel]
       CPU: 3 PID: 2840 Comm: qemu-system-x86 Tainted: G           OE   4.12.0-rc3+ #23
       RIP: 0010:nested_vmx_vmexit+0xdcd/0xde0 [kvm_intel]
       Call Trace:
        ? kvm_check_async_pf_completion+0xef/0x120 [kvm]
        ? rcu_read_lock_sched_held+0x79/0x80
        vmx_queue_exception+0x104/0x160 [kvm_intel]
        ? vmx_queue_exception+0x104/0x160 [kvm_intel]
        kvm_arch_vcpu_ioctl_run+0x1171/0x1ce0 [kvm]
        ? kvm_arch_vcpu_load+0x47/0x240 [kvm]
        ? kvm_arch_vcpu_load+0x62/0x240 [kvm]
        kvm_vcpu_ioctl+0x384/0x7b0 [kvm]
        ? kvm_vcpu_ioctl+0x384/0x7b0 [kvm]
        ? __fget+0xf3/0x210
        do_vfs_ioctl+0xa4/0x700
        ? __fget+0x114/0x210
        SyS_ioctl+0x79/0x90
        do_syscall_64+0x81/0x220
        entry_SYSCALL64_slow_path+0x25/0x25
      
      This is triggered occasionally by running both win7 and win2016 in L2, in
      addition, EPT is disabled on both L1 and L2. It can't be reproduced easily.
      
      Commit 0b6ac343 (KVM: nVMX: Correct handling of exception injection) mentioned
      that "KVM wants to inject page-faults which it got to the guest. This function
      assumes it is called with the exit reason in vmcs02 being a #PF exception".
      Commit e011c663 (KVM: nVMX: Check all exceptions for intercept during delivery to
      L2) allows to check all exceptions for intercept during delivery to L2. However,
      there is no guarantee the exit reason is exception currently, when there is an
      external interrupt occurred on host, maybe a time interrupt for host which should
      not be injected to guest, and somewhere queues an exception, then the function
      nested_vmx_check_exception() will be called and the vmexit emulation codes will
      try to emulate the "Acknowledge interrupt on exit" behavior, the warning is
      triggered.
      
      Reusing the exit reason from the L2->L0 vmexit is wrong in this case,
      the reason must always be EXCEPTION_NMI when injecting an exception into
      L1 as a nested vmexit.
      
      Cc: Paolo Bonzini <pbonzini@redhat.com>
      Cc: Radim Krčmář <rkrcmar@redhat.com>
      Signed-off-by: default avatarWanpeng Li <wanpeng.li@hotmail.com>
      Fixes: e011c663 ("KVM: nVMX: Check all exceptions for intercept during delivery to L2")
      Signed-off-by: default avatarRadim Krčmář <rkrcmar@redhat.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      be8c39b4
    • Radim Krčmář's avatar
      KVM: x86: zero base3 of unusable segments · 77d977dd
      Radim Krčmář authored
      commit f0367ee1 upstream.
      
      Static checker noticed that base3 could be used uninitialized if the
      segment was not present (useable).  Random stack values probably would
      not pass VMCS entry checks.
      Reported-by: default avatarDan Carpenter <dan.carpenter@oracle.com>
      Fixes: 1aa36616 ("KVM: x86 emulator: consolidate segment accessors")
      Reviewed-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      Reviewed-by: default avatarDavid Hildenbrand <david@redhat.com>
      Signed-off-by: default avatarRadim Krčmář <rkrcmar@redhat.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      77d977dd
    • Radim Krčmář's avatar
      KVM: x86/vPMU: fix undefined shift in intel_pmu_refresh() · 3b1609f6
      Radim Krčmář authored
      commit 34b0dadb upstream.
      
      Static analysis noticed that pmu->nr_arch_gp_counters can be 32
      (INTEL_PMC_MAX_GENERIC) and therefore cannot be used to shift 'int'.
      
      I didn't add BUILD_BUG_ON for it as we have a better checker.
      Reported-by: default avatarDan Carpenter <dan.carpenter@oracle.com>
      Fixes: 25462f7f ("KVM: x86/vPMU: Define kvm_pmu_ops to support vPMU function dispatch")
      Reviewed-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      Reviewed-by: default avatarDavid Hildenbrand <david@redhat.com>
      Signed-off-by: default avatarRadim Krčmář <rkrcmar@redhat.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      3b1609f6
    • Ladi Prosek's avatar
      KVM: x86: fix emulation of RSM and IRET instructions · b9b3eb5c
      Ladi Prosek authored
      commit 6ed071f0 upstream.
      
      On AMD, the effect of set_nmi_mask called by emulate_iret_real and em_rsm
      on hflags is reverted later on in x86_emulate_instruction where hflags are
      overwritten with ctxt->emul_flags (the kvm_set_hflags call). This manifests
      as a hang when rebooting Windows VMs with QEMU, OVMF, and >1 vcpu.
      
      Instead of trying to merge ctxt->emul_flags into vcpu->arch.hflags after
      an instruction is emulated, this commit deletes emul_flags altogether and
      makes the emulator access vcpu->arch.hflags using two new accessors. This
      way all changes, on the emulator side as well as in functions called from
      the emulator and accessing vcpu state with emul_to_vcpu, are preserved.
      
      More details on the bug and its manifestation with Windows and OVMF:
      
        It's a KVM bug in the interaction between SMI/SMM and NMI, specific to AMD.
        I believe that the SMM part explains why we started seeing this only with
        OVMF.
      
        KVM masks and unmasks NMI when entering and leaving SMM. When KVM emulates
        the RSM instruction in em_rsm, the set_nmi_mask call doesn't stick because
        later on in x86_emulate_instruction we overwrite arch.hflags with
        ctxt->emul_flags, effectively reverting the effect of the set_nmi_mask call.
        The AMD-specific hflag of interest here is HF_NMI_MASK.
      
        When rebooting the system, Windows sends an NMI IPI to all but the current
        cpu to shut them down. Only after all of them are parked in HLT will the
        initiating cpu finish the restart. If NMI is masked, other cpus never get
        the memo and the initiating cpu spins forever, waiting for
        hal!HalpInterruptProcessorsStarted to drop. That's the symptom we observe.
      
      Fixes: a584539b ("KVM: x86: pass the whole hflags field to emulator and back")
      Signed-off-by: default avatarLadi Prosek <lprosek@redhat.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      b9b3eb5c
    • Dan Carpenter's avatar
      cpufreq: s3c2416: double free on driver init error path · 3491a0b5
      Dan Carpenter authored
      commit a69261e4 upstream.
      
      The "goto err_armclk;" error path already does a clk_put(s3c_freq->hclk);
      so this is a double free.
      
      Fixes: 34ee5507 ([CPUFREQ] Add S3C2416/S3C2450 cpufreq driver)
      Signed-off-by: default avatarDan Carpenter <dan.carpenter@oracle.com>
      Reviewed-by: default avatarKrzysztof Kozlowski <krzk@kernel.org>
      Acked-by: default avatarViresh Kumar <viresh.kumar@linaro.org>
      Signed-off-by: default avatarRafael J. Wysocki <rafael.j.wysocki@intel.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      3491a0b5
    • Pan Bian's avatar
      iommu/amd: Fix incorrect error handling in amd_iommu_bind_pasid() · aad7041e
      Pan Bian authored
      commit 73dbd4a4 upstream.
      
      In function amd_iommu_bind_pasid(), the control flow jumps
      to label out_free when pasid_state->mm and mm is NULL. And
      mmput(mm) is called.  In function mmput(mm), mm is
      referenced without validation. This will result in a NULL
      dereference bug. This patch fixes the bug.
      Signed-off-by: default avatarPan Bian <bianpan2016@163.com>
      Fixes: f0aac63b ('iommu/amd: Don't hold a reference to mm_struct')
      Signed-off-by: default avatarJoerg Roedel <jroedel@suse.de>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      aad7041e
    • Robin Murphy's avatar
      iommu: Handle default domain attach failure · 48952c6d
      Robin Murphy authored
      commit 797a8b4d upstream.
      
      We wouldn't normally expect ops->attach_dev() to fail, but on IOMMUs
      with limited hardware resources, or generally misconfigured systems,
      it is certainly possible. We report failure correctly from the external
      iommu_attach_device() interface, but do not do so in iommu_group_add()
      when attaching to the default domain. The result of failure there is
      that the device, group and domain all get left in a broken,
      part-configured state which leads to weird errors and misbehaviour down
      the line when IOMMU API calls sort-of-but-don't-quite work.
      
      Check the return value of __iommu_attach_device() on the default domain,
      and refactor the error handling paths to cope with its failure and clean
      up correctly in such cases.
      
      Fixes: e39cb8a3 ("iommu: Make sure a device is always attached to a domain")
      Reported-by: default avatarPunit Agrawal <punit.agrawal@arm.com>
      Signed-off-by: default avatarRobin Murphy <robin.murphy@arm.com>
      Signed-off-by: default avatarJoerg Roedel <jroedel@suse.de>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      48952c6d
    • David Dillow's avatar
      iommu/vt-d: Don't over-free page table directories · 3de9630a
      David Dillow authored
      commit f7116e11 upstream.
      
      dma_pte_free_level() recurses down the IOMMU page tables and frees
      directory pages that are entirely contained in the given PFN range.
      Unfortunately, it incorrectly calculates the starting address covered
      by the PTE under consideration, which can lead to it clearing an entry
      that is still in use.
      
      This occurs if we have a scatterlist with an entry that has a length
      greater than 1026 MB and is aligned to 2 MB for both the IOMMU and
      physical addresses. For example, if __domain_mapping() is asked to map a
      two-entry scatterlist with 2 MB and 1028 MB segments to PFN 0xffff80000,
      it will ask if dma_pte_free_pagetable() is asked to PFNs from
      0xffff80200 to 0xffffc05ff, it will also incorrectly clear the PFNs from
      0xffff80000 to 0xffff801ff because of this issue. The current code will
      set level_pfn to 0xffff80200, and 0xffff80200-0xffffc01ff fits inside
      the range being cleared. Properly setting the level_pfn for the current
      level under consideration catches that this PTE is outside of the range
      being cleared.
      
      This patch also changes the value passed into dma_pte_free_level() when
      it recurses. This only affects the first PTE of the range being cleared,
      and is handled by the existing code that ensures we start our cursor no
      lower than start_pfn.
      
      This was found when using dma_map_sg() to map large chunks of contiguous
      memory, which immediatedly led to faults on the first access of the
      erroneously-deleted mappings.
      
      Fixes: 3269ee0b ("intel-iommu: Fix leaks in pagetable freeing")
      Reviewed-by: default avatarBenjamin Serebrin <serebrin@google.com>
      Signed-off-by: default avatarDavid Dillow <dillow@google.com>
      Signed-off-by: default avatarJoerg Roedel <jroedel@suse.de>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      3de9630a
    • Junxiao Bi's avatar
      ocfs2: o2hb: revert hb threshold to keep compatible · 404ef3b4
      Junxiao Bi authored
      commit 33496c3c upstream.
      
      Configfs is the interface for ocfs2-tools to set configure to kernel and
      $configfs_dir/cluster/$clustername/heartbeat/dead_threshold is the one
      used to configure heartbeat dead threshold.  Kernel has a default value
      of it but user can set O2CB_HEARTBEAT_THRESHOLD in /etc/sysconfig/o2cb
      to override it.
      
      Commit 45b99773 ("ocfs2/cluster: use per-attribute show and store
      methods") changed heartbeat dead threshold name while ocfs2-tools did
      not, so ocfs2-tools won't set this configurable and the default value is
      always used.  So revert it.
      
      Fixes: 45b99773 ("ocfs2/cluster: use per-attribute show and store methods")
      Link: http://lkml.kernel.org/r/1490665245-15374-1-git-send-email-junxiao.bi@oracle.comSigned-off-by: default avatarJunxiao Bi <junxiao.bi@oracle.com>
      Acked-by: default avatarJoseph Qi <jiangqi903@gmail.com>
      Cc: Mark Fasheh <mfasheh@versity.com>
      Cc: Joel Becker <jlbec@evilplan.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      404ef3b4
    • Andy Lutomirski's avatar
      x86/mm: Fix flush_tlb_page() on Xen · 5d650fce
      Andy Lutomirski authored
      commit dbd68d8e upstream.
      
      flush_tlb_page() passes a bogus range to flush_tlb_others() and
      expects the latter to fix it up.  native_flush_tlb_others() has the
      fixup but Xen's version doesn't.  Move the fixup to
      flush_tlb_others().
      
      AFAICS the only real effect is that, without this fix, Xen would
      flush everything instead of just the one page on remote vCPUs in
      when flush_tlb_page() was called.
      Signed-off-by: default avatarAndy Lutomirski <luto@kernel.org>
      Reviewed-by: default avatarBoris Ostrovsky <boris.ostrovsky@oracle.com>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Brian Gerst <brgerst@gmail.com>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: Denys Vlasenko <dvlasenk@redhat.com>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Josh Poimboeuf <jpoimboe@redhat.com>
      Cc: Juergen Gross <jgross@suse.com>
      Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Nadav Amit <namit@vmware.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Fixes: e7b52ffd ("x86/flush_tlb: try flush_tlb_single one by one in flush_tlb_range")
      Link: http://lkml.kernel.org/r/10ed0e4dfea64daef10b87fb85df1746999b4dba.1492844372.git.luto@kernel.orgSigned-off-by: default avatarIngo Molnar <mingo@kernel.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      5d650fce
    • Joerg Roedel's avatar
      x86/mpx: Correctly report do_mpx_bt_fault() failures to user-space · 6fb3b322
      Joerg Roedel authored
      commit 5ed386ec upstream.
      
      When this function fails it just sends a SIGSEGV signal to
      user-space using force_sig(). This signal is missing
      essential information about the cause, e.g. the trap_nr or
      an error code.
      
      Fix this by propagating the error to the only caller of
      mpx_handle_bd_fault(), do_bounds(), which sends the correct
      SIGSEGV signal to the process.
      Signed-off-by: default avatarJoerg Roedel <jroedel@suse.de>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Brian Gerst <brgerst@gmail.com>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: Denys Vlasenko <dvlasenk@redhat.com>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Josh Poimboeuf <jpoimboe@redhat.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Fixes: fe3d197f ('x86, mpx: On-demand kernel allocation of bounds tables')
      Link: http://lkml.kernel.org/r/1491488362-27198-1-git-send-email-joro@8bytes.orgSigned-off-by: default avatarIngo Molnar <mingo@kernel.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      6fb3b322
    • Doug Berger's avatar
      ARM: 8685/1: ensure memblock-limit is pmd-aligned · 7cd8c490
      Doug Berger authored
      commit 9e25ebfe upstream.
      
      The pmd containing memblock_limit is cleared by prepare_page_table()
      which creates the opportunity for early_alloc() to allocate unmapped
      memory if memblock_limit is not pmd aligned causing a boot-time hang.
      
      Commit 965278dc ("ARM: 8356/1: mm: handle non-pmd-aligned end of RAM")
      attempted to resolve this problem, but there is a path through the
      adjust_lowmem_bounds() routine where if all memory regions start and
      end on pmd-aligned addresses the memblock_limit will be set to
      arm_lowmem_limit.
      
      Since arm_lowmem_limit can be affected by the vmalloc early parameter,
      the value of arm_lowmem_limit may not be pmd-aligned. This commit
      corrects this oversight such that memblock_limit is always rounded
      down to pmd-alignment.
      
      Fixes: 965278dc ("ARM: 8356/1: mm: handle non-pmd-aligned end of RAM")
      Signed-off-by: default avatarDoug Berger <opendmb@gmail.com>
      Suggested-by: default avatarMark Rutland <mark.rutland@arm.com>
      Signed-off-by: default avatarRussell King <rmk+kernel@armlinux.org.uk>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      7cd8c490
    • Lorenzo Pieralisi's avatar
      ARM64/ACPI: Fix BAD_MADT_GICC_ENTRY() macro implementation · d4960d58
      Lorenzo Pieralisi authored
      commit cb7cf772 upstream.
      
      The BAD_MADT_GICC_ENTRY() macro checks if a GICC MADT entry passes
      muster from an ACPI specification standpoint. Current macro detects the
      MADT GICC entry length through ACPI firmware version (it changed from 76
      to 80 bytes in the transition from ACPI 5.1 to ACPI 6.0 specification)
      but always uses (erroneously) the ACPICA (latest) struct (ie struct
      acpi_madt_generic_interrupt - that is 80-bytes long) length to check if
      the current GICC entry memory record exceeds the MADT table end in
      memory as defined by the MADT table header itself, which may result in
      false negatives depending on the ACPI firmware version and how the MADT
      entries are laid out in memory (ie on ACPI 5.1 firmware MADT GICC
      entries are 76 bytes long, so by adding 80 to a GICC entry start address
      in memory the resulting address may well be past the actual MADT end,
      triggering a false negative).
      
      Fix the BAD_MADT_GICC_ENTRY() macro by reshuffling the condition checks
      and update them to always use the firmware version specific MADT GICC
      entry length in order to carry out boundary checks.
      
      Fixes: b6cfb277 ("ACPI / ARM64: add BAD_MADT_GICC_ENTRY() macro")
      Reported-by: default avatarJulien Grall <julien.grall@arm.com>
      Acked-by: default avatarWill Deacon <will.deacon@arm.com>
      Acked-by: default avatarMarc Zyngier <marc.zyngier@arm.com>
      Signed-off-by: default avatarLorenzo Pieralisi <lorenzo.pieralisi@arm.com>
      Cc: Julien Grall <julien.grall@arm.com>
      Cc: Hanjun Guo <hanjun.guo@linaro.org>
      Cc: Al Stone <ahs3@redhat.com>
      Signed-off-by: default avatarCatalin Marinas <catalin.marinas@arm.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      d4960d58
    • Matt Fleming's avatar
      sched/loadavg: Avoid loadavg spikes caused by delayed NO_HZ accounting · 6ca11db5
      Matt Fleming authored
      commit 6e5f32f7 upstream.
      
      If we crossed a sample window while in NO_HZ we will add LOAD_FREQ to
      the pending sample window time on exit, setting the next update not
      one window into the future, but two.
      
      This situation on exiting NO_HZ is described by:
      
        this_rq->calc_load_update < jiffies < calc_load_update
      
      In this scenario, what we should be doing is:
      
        this_rq->calc_load_update = calc_load_update		     [ next window ]
      
      But what we actually do is:
      
        this_rq->calc_load_update = calc_load_update + LOAD_FREQ   [ next+1 window ]
      
      This has the effect of delaying load average updates for potentially
      up to ~9seconds.
      
      This can result in huge spikes in the load average values due to
      per-cpu uninterruptible task counts being out of sync when accumulated
      across all CPUs.
      
      It's safe to update the per-cpu active count if we wake between sample
      windows because any load that we left in 'calc_load_idle' will have
      been zero'd when the idle load was folded in calc_global_load().
      
      This issue is easy to reproduce before,
      
        commit 9d89c257 ("sched/fair: Rewrite runnable load and utilization average tracking")
      
      just by forking short-lived process pipelines built from ps(1) and
      grep(1) in a loop. I'm unable to reproduce the spikes after that
      commit, but the bug still seems to be present from code review.
      Signed-off-by: default avatarMatt Fleming <matt@codeblueprint.co.uk>
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Frederic Weisbecker <fweisbec@gmail.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Mike Galbraith <umgwanakikbuti@gmail.com>
      Cc: Morten Rasmussen <morten.rasmussen@arm.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Vincent Guittot <vincent.guittot@linaro.org>
      Fixes: commit 5167e8d5 ("sched/nohz: Rewrite and fix load-avg computation -- again")
      Link: http://lkml.kernel.org/r/20170217120731.11868-2-matt@codeblueprint.co.ukSigned-off-by: default avatarIngo Molnar <mingo@kernel.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      6ca11db5