1. 18 Jun, 2019 11 commits
  2. 13 Jun, 2019 1 commit
    • Paolo Bonzini's avatar
      KVM: x86: clean up conditions for asynchronous page fault handling · 1dfdb45e
      Paolo Bonzini authored
      Even when asynchronous page fault is disabled, KVM does not want to pause
      the host if a guest triggers a page fault; instead it will put it into
      an artificial HLT state that allows running other host processes while
      allowing interrupt delivery into the guest.
      
      However, the way this feature is triggered is a bit confusing.
      First, it is not used for page faults while a nested guest is
      running: but this is not an issue since the artificial halt
      is completely invisible to the guest, either L1 or L2.  Second,
      it is used even if kvm_halt_in_guest() returns true; in this case,
      the guest probably should not pay the additional latency cost of the
      artificial halt, and thus we should handle the page fault in a
      completely synchronous way.
      
      By introducing a new function kvm_can_deliver_async_pf, this patch
      commonizes the code that chooses whether to deliver an async page fault
      (kvm_arch_async_page_not_present) and the code that chooses whether a
      page fault should be handled synchronously (kvm_can_do_async_pf).
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      1dfdb45e
  3. 05 Jun, 2019 7 commits
  4. 04 Jun, 2019 13 commits
    • Andrew Jones's avatar
      kvm: selftests: ucall improvements · 2c7c5d3d
      Andrew Jones authored
      Make sure we complete the I/O after determining we have a ucall,
      which is I/O. Also allow the *uc parameter to optionally be NULL.
      It's quite possible that a test case will only care about the
      return value, like for example when looping on a check for
      UCALL_DONE.
      Signed-off-by: default avatarAndrew Jones <drjones@redhat.com>
      Reviewed-by: default avatarPeter Xu <peterx@redhat.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      2c7c5d3d
    • Wanpeng Li's avatar
      KVM: X86: Emulate MSR_IA32_MISC_ENABLE MWAIT bit · 511a8556
      Wanpeng Li authored
      MSR IA32_MISC_ENABLE bit 18, according to SDM:
      
      | When this bit is set to 0, the MONITOR feature flag is not set (CPUID.01H:ECX[bit 3] = 0).
      | This indicates that MONITOR/MWAIT are not supported.
      |
      | Software attempts to execute MONITOR/MWAIT will cause #UD when this bit is 0.
      |
      | When this bit is set to 1 (default), MONITOR/MWAIT are supported (CPUID.01H:ECX[bit 3] = 1).
      
      The CPUID.01H:ECX[bit 3] ought to mirror the value of the MSR bit,
      CPUID.01H:ECX[bit 3] is a better guard than kvm_mwait_in_guest().
      kvm_mwait_in_guest() affects the behavior of MONITOR/MWAIT, not its
      guest visibility.
      
      This patch implements toggling of the CPUID bit based on guest writes
      to the MSR.
      
      Cc: Paolo Bonzini <pbonzini@redhat.com>
      Cc: Radim Krčmář <rkrcmar@redhat.com>
      Cc: Sean Christopherson <sean.j.christopherson@intel.com>
      Cc: Liran Alon <liran.alon@oracle.com>
      Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
      Signed-off-by: default avatarWanpeng Li <wanpengli@tencent.com>
      [Fixes for backwards compatibility - Paolo]
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      511a8556
    • Wanpeng Li's avatar
      KVM: X86: Provide a capability to disable cstate msr read intercepts · b5170063
      Wanpeng Li authored
      Allow guest reads CORE cstate when exposing host CPU power management capabilities
      to the guest. PKG cstate is restricted to avoid a guest to get the whole package
      information in multi-tenant scenario.
      
      Cc: Paolo Bonzini <pbonzini@redhat.com>
      Cc: Radim Krčmář <rkrcmar@redhat.com>
      Cc: Sean Christopherson <sean.j.christopherson@intel.com>
      Cc: Liran Alon <liran.alon@oracle.com>
      Signed-off-by: default avatarWanpeng Li <wanpengli@tencent.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      b5170063
    • Wanpeng Li's avatar
      KVM: Documentation: Add disable pause exits to KVM_CAP_X86_DISABLE_EXITS · 8ffdaa7f
      Wanpeng Li authored
      Commit b31c114b (KVM: X86: Provide a capability to disable PAUSE intercepts)
      forgot to add the KVM_X86_DISABLE_EXITS_PAUSE into api doc. This patch adds
      it.
      
      Cc: Paolo Bonzini <pbonzini@redhat.com>
      Cc: Radim Krčmář <rkrcmar@redhat.com>
      Cc: Sean Christopherson <sean.j.christopherson@intel.com>
      Cc: Liran Alon <liran.alon@oracle.com>
      Signed-off-by: default avatarWanpeng Li <wanpengli@tencent.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      8ffdaa7f
    • Xiaoyao Li's avatar
      kvm: x86: refine kvm_get_arch_capabilities() · 4d22c17c
      Xiaoyao Li authored
      1. Using X86_FEATURE_ARCH_CAPABILITIES to enumerate the existence of
      MSR_IA32_ARCH_CAPABILITIES to avoid using rdmsrl_safe().
      
      2. Since kvm_get_arch_capabilities() is only used in this file, making
      it static.
      Signed-off-by: default avatarXiaoyao Li <xiaoyao.li@linux.intel.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      4d22c17c
    • Sean Christopherson's avatar
      KVM: Directly return result from kvm_arch_check_processor_compat() · f257d6dc
      Sean Christopherson authored
      Add a wrapper to invoke kvm_arch_check_processor_compat() so that the
      boilerplate ugliness of checking virtualization support on all CPUs is
      hidden from the arch specific code.  x86's implementation in particular
      is quite heinous, as it unnecessarily propagates the out-param pattern
      into kvm_x86_ops.
      
      While the x86 specific issue could be resolved solely by changing
      kvm_x86_ops, make the change for all architectures as returning a value
      directly is prettier and technically more robust, e.g. s390 doesn't set
      the out param, which could lead to subtle breakage in the (highly
      unlikely) scenario where the out-param was not pre-initialized by the
      caller.
      
      Opportunistically annotate svm_check_processor_compat() with __init.
      Signed-off-by: default avatarSean Christopherson <sean.j.christopherson@intel.com>
      Reviewed-by: default avatarCornelia Huck <cohuck@redhat.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      f257d6dc
    • Suthikulpanit, Suravee's avatar
      kvm: svm/avic: Do not send AVIC doorbell to self · 0532dd52
      Suthikulpanit, Suravee authored
      AVIC doorbell is used to notify a running vCPU that interrupts
      has been injected into the vCPU AVIC backing page. Current logic
      checks only if a VCPU is running before sending a doorbell.
      However, the doorbell is not necessary if the destination
      CPU is itself.
      
      Add logic to check currently running CPU before sending doorbell.
      Signed-off-by: default avatarSuravee Suthikulpanit <suravee.suthikulpanit@amd.com>
      Reviewed-by: default avatarAlexander Graf <graf@amazon.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      0532dd52
    • Wanpeng Li's avatar
      KVM: LAPIC: Optimize timer latency further · b6c4bc65
      Wanpeng Li authored
      Advance lapic timer tries to hidden the hypervisor overhead between the
      host emulated timer fires and the guest awares the timer is fired. However,
      it just hidden the time between apic_timer_fn/handle_preemption_timer ->
      wait_lapic_expire, instead of the real position of vmentry which is
      mentioned in the orignial commit d0659d94 ("KVM: x86: add option to
      advance tscdeadline hrtimer expiration"). There is 700+ cpu cycles between
      the end of wait_lapic_expire and before world switch on my haswell desktop.
      
      This patch tries to narrow the last gap(wait_lapic_expire -> world switch),
      it takes the real overhead time between apic_timer_fn/handle_preemption_timer
      and before world switch into consideration when adaptively tuning timer
      advancement. The patch can reduce 40% latency (~1600+ cycles to ~1000+ cycles
      on a haswell desktop) for kvm-unit-tests/tscdeadline_latency when testing
      busy waits.
      
      Cc: Paolo Bonzini <pbonzini@redhat.com>
      Cc: Radim Krčmář <rkrcmar@redhat.com>
      Cc: Sean Christopherson <sean.j.christopherson@intel.com>
      Cc: Liran Alon <liran.alon@oracle.com>
      Reviewed-by: default avatarSean Christopherson <sean.j.christopherson@intel.com>
      Signed-off-by: default avatarWanpeng Li <wanpengli@tencent.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      b6c4bc65
    • Wanpeng Li's avatar
      KVM: LAPIC: Delay trace_kvm_wait_lapic_expire tracepoint to after vmexit · ec0671d5
      Wanpeng Li authored
      wait_lapic_expire() call was moved above guest_enter_irqoff() because of
      its tracepoint, which violated the RCU extended quiescent state invoked
      by guest_enter_irqoff()[1][2]. This patch simply moves the tracepoint
      below guest_exit_irqoff() in vcpu_enter_guest(). Snapshot the delta before
      VM-Enter, but trace it after VM-Exit. This can help us to move
      wait_lapic_expire() just before vmentry in the later patch.
      
      [1] Commit 8b89fe1f ("kvm: x86: move tracepoints outside extended quiescent state")
      [2] https://patchwork.kernel.org/patch/7821111/
      
      Cc: Paolo Bonzini <pbonzini@redhat.com>
      Cc: Radim Krčmář <rkrcmar@redhat.com>
      Cc: Liran Alon <liran.alon@oracle.com>
      Suggested-by: default avatarSean Christopherson <sean.j.christopherson@intel.com>
      Signed-off-by: default avatarWanpeng Li <wanpengli@tencent.com>
      [Track whether wait_lapic_expire was called, and do not invoke the tracepoint
       if not. - Paolo]
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      ec0671d5
    • Wanpeng Li's avatar
      KVM: LAPIC: Extract adaptive tune timer advancement logic · 84ea3aca
      Wanpeng Li authored
      Extract adaptive tune timer advancement logic to a single function.
      
      Cc: Paolo Bonzini <pbonzini@redhat.com>
      Cc: Radim Krčmář <rkrcmar@redhat.com>
      Cc: Sean Christopherson <sean.j.christopherson@intel.com>
      Cc: Liran Alon <liran.alon@oracle.com>
      Signed-off-by: default avatarWanpeng Li <wanpengli@tencent.com>
      [Rename new function. - Paolo]
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      84ea3aca
    • Vitaly Kuznetsov's avatar
      KVM/nSVM: properly map nested VMCB · 8f38302c
      Vitaly Kuznetsov authored
      Commit 8c5fbf1a ("KVM/nSVM: Use the new mapping API for mapping guest
      memory") broke nested SVM completely: kvm_vcpu_map()'s second parameter is
      GFN so vmcb_gpa needs to be converted with gpa_to_gfn(), not the other way
      around.
      
      Fixes: 8c5fbf1a ("KVM/nSVM: Use the new mapping API for mapping guest memory")
      Signed-off-by: default avatarVitaly Kuznetsov <vkuznets@redhat.com>
      Reviewed-by: default avatarSean Christopherson <sean.j.christopherson@intel.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      8f38302c
    • Kai Huang's avatar
      kvm: x86: Fix reserved bits related calculation errors caused by MKTME · f3ecb59d
      Kai Huang authored
      Intel MKTME repurposes several high bits of physical address as 'keyID'
      for memory encryption thus effectively reduces platform's maximum
      physical address bits. Exactly how many bits are reduced is configured
      by BIOS. To honor such HW behavior, the repurposed bits are reduced from
      cpuinfo_x86->x86_phys_bits when MKTME is detected in CPU detection.
      Similarly, AMD SME/SEV also reduces physical address bits for memory
      encryption, and cpuinfo->x86_phys_bits is reduced too when SME/SEV is
      detected, so for both MKTME and SME/SEV, boot_cpu_data.x86_phys_bits
      doesn't hold physical address bits reported by CPUID anymore.
      
      Currently KVM treats bits from boot_cpu_data.x86_phys_bits to 51 as
      reserved bits, but it's not true anymore for MKTME, since MKTME treats
      those reduced bits as 'keyID', but not reserved bits. Therefore
      boot_cpu_data.x86_phys_bits cannot be used to calculate reserved bits
      anymore, although we can still use it for AMD SME/SEV since SME/SEV
      treats the reduced bits differently -- they are treated as reserved
      bits, the same as other reserved bits in page table entity [1].
      
      Fix by introducing a new 'shadow_phys_bits' variable in KVM x86 MMU code
      to store the effective physical bits w/o reserved bits -- for MKTME,
      it equals to physical address reported by CPUID, and for SME/SEV, it is
      boot_cpu_data.x86_phys_bits.
      
      Note that for the physical address bits reported to guest should remain
      unchanged -- KVM should report physical address reported by CPUID to
      guest, but not boot_cpu_data.x86_phys_bits. Because for Intel MKTME,
      there's no harm if guest sets up 'keyID' bits in guest page table (since
      MKTME only works at physical address level), and KVM doesn't even expose
      MKTME to guest. Arguably, for AMD SME/SEV, guest is aware of SEV thus it
      should adjust boot_cpu_data.x86_phys_bits when it detects SEV, therefore
      KVM should still reports physcial address reported by CPUID to guest.
      Reviewed-by: default avatarSean Christopherson <sean.j.christopherson@intel.com>
      Signed-off-by: default avatarKai Huang <kai.huang@linux.intel.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      f3ecb59d
    • Kai Huang's avatar
      kvm: x86: Move kvm_set_mmio_spte_mask() from x86.c to mmu.c · 7b6f8a06
      Kai Huang authored
      As a prerequisite to fix several SPTE reserved bits related calculation
      errors caused by MKTME, which requires kvm_set_mmio_spte_mask() to use
      local static variable defined in mmu.c.
      
      Also move call site of kvm_set_mmio_spte_mask() from kvm_arch_init() to
      kvm_mmu_module_init() so that kvm_set_mmio_spte_mask() can be static.
      Reviewed-by: default avatarSean Christopherson <sean.j.christopherson@intel.com>
      Signed-off-by: default avatarKai Huang <kai.huang@linux.intel.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      7b6f8a06
  5. 31 May, 2019 2 commits
  6. 30 May, 2019 6 commits
    • Suraj Jitindar Singh's avatar
      KVM: PPC: Book3S HV: Restore SPRG3 in kvmhv_p9_guest_entry() · d724c9e5
      Suraj Jitindar Singh authored
      The sprgs are a set of 4 general purpose sprs provided for software use.
      SPRG3 is special in that it can also be read from userspace. Thus it is
      used on linux to store the cpu and numa id of the process to speed up
      syscall access to this information.
      
      This register is overwritten with the guest value on kvm guest entry,
      and so needs to be restored on exit again. Thus restore the value on
      the guest exit path in kvmhv_p9_guest_entry().
      
      Cc: stable@vger.kernel.org # v4.20+
      Fixes: 95a6432c ("KVM: PPC: Book3S HV: Streamlined guest entry/exit path on P9 for radix guests")
      Signed-off-by: default avatarSuraj Jitindar Singh <sjitindarsingh@gmail.com>
      Signed-off-by: default avatarPaul Mackerras <paulus@ozlabs.org>
      d724c9e5
    • Paul Mackerras's avatar
      KVM: PPC: Book3S HV: Fix lockdep warning when entering guest on POWER9 · 1b28d553
      Paul Mackerras authored
      Commit 3309bec8 ("KVM: PPC: Book3S HV: Fix lockdep warning when
      entering the guest") moved calls to trace_hardirqs_{on,off} in the
      entry path used for HPT guests.  Similar code exists in the new
      streamlined entry path used for radix guests on POWER9.  This makes
      the same change there, so as to avoid lockdep warnings such as this:
      
      [  228.686461] DEBUG_LOCKS_WARN_ON(current->hardirqs_enabled)
      [  228.686480] WARNING: CPU: 116 PID: 3803 at ../kernel/locking/lockdep.c:4219 check_flags.part.23+0x21c/0x270
      [  228.686544] Modules linked in: vhost_net vhost xt_CHECKSUM iptable_mangle xt_MASQUERADE iptable_nat nf_nat
      +xt_conntrack nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 ipt_REJECT nf_reject_ipv4 tun bridge stp llc ebtable_filter
      +ebtables ip6table_filter ip6_tables iptable_filter fuse kvm_hv kvm at24 ipmi_powernv regmap_i2c ipmi_devintf
      +uio_pdrv_genirq ofpart ipmi_msghandler uio powernv_flash mtd ibmpowernv opal_prd ip_tables ext4 mbcache jbd2 btrfs
      +zstd_decompress zstd_compress raid10 raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx libcrc32c xor
      +raid6_pq raid1 raid0 ses sd_mod enclosure scsi_transport_sas ast i2c_opal i2c_algo_bit drm_kms_helper syscopyarea
      +sysfillrect sysimgblt fb_sys_fops ttm drm i40e e1000e cxl aacraid tg3 drm_panel_orientation_quirks i2c_core
      [  228.686859] CPU: 116 PID: 3803 Comm: qemu-system-ppc Kdump: loaded Not tainted 5.2.0-rc1-xive+ #42
      [  228.686911] NIP:  c0000000001b394c LR: c0000000001b3948 CTR: c000000000bfad20
      [  228.686963] REGS: c000200cdb50f570 TRAP: 0700   Not tainted  (5.2.0-rc1-xive+)
      [  228.687001] MSR:  9000000002823033 <SF,HV,VEC,VSX,FP,ME,IR,DR,RI,LE>  CR: 48222222  XER: 20040000
      [  228.687060] CFAR: c000000000116db0 IRQMASK: 1
      [  228.687060] GPR00: c0000000001b3948 c000200cdb50f800 c0000000015e7600 000000000000002e
      [  228.687060] GPR04: 0000000000000001 c0000000001c71a0 000000006e655f73 72727563284e4f5f
      [  228.687060] GPR08: 0000200e60680000 0000000000000000 c000200cdb486180 0000000000000000
      [  228.687060] GPR12: 0000000000002000 c000200fff61a680 0000000000000000 00007fffb75c0000
      [  228.687060] GPR16: 0000000000000000 0000000000000000 c0000000017d6900 c000000001124900
      [  228.687060] GPR20: 0000000000000074 c008000006916f68 0000000000000074 0000000000000074
      [  228.687060] GPR24: ffffffffffffffff ffffffffffffffff 0000000000000003 c000200d4b600000
      [  228.687060] GPR28: c000000001627e58 c000000001489908 c000000001627e58 c000000002304de0
      [  228.687377] NIP [c0000000001b394c] check_flags.part.23+0x21c/0x270
      [  228.687415] LR [c0000000001b3948] check_flags.part.23+0x218/0x270
      [  228.687466] Call Trace:
      [  228.687488] [c000200cdb50f800] [c0000000001b3948] check_flags.part.23+0x218/0x270 (unreliable)
      [  228.687542] [c000200cdb50f870] [c0000000001b6548] lock_is_held_type+0x188/0x1c0
      [  228.687595] [c000200cdb50f8d0] [c0000000001d939c] rcu_read_lock_sched_held+0xdc/0x100
      [  228.687646] [c000200cdb50f900] [c0000000001dd704] rcu_note_context_switch+0x304/0x340
      [  228.687701] [c000200cdb50f940] [c0080000068fcc58] kvmhv_run_single_vcpu+0xdb0/0x1120 [kvm_hv]
      [  228.687756] [c000200cdb50fa20] [c0080000068fd5b0] kvmppc_vcpu_run_hv+0x5e8/0xe40 [kvm_hv]
      [  228.687816] [c000200cdb50faf0] [c0080000071797dc] kvmppc_vcpu_run+0x34/0x48 [kvm]
      [  228.687863] [c000200cdb50fb10] [c0080000071755dc] kvm_arch_vcpu_ioctl_run+0x244/0x420 [kvm]
      [  228.687916] [c000200cdb50fba0] [c008000007165ccc] kvm_vcpu_ioctl+0x424/0x838 [kvm]
      [  228.687957] [c000200cdb50fd10] [c000000000433a24] do_vfs_ioctl+0xd4/0xcd0
      [  228.687995] [c000200cdb50fdb0] [c000000000434724] ksys_ioctl+0x104/0x120
      [  228.688033] [c000200cdb50fe00] [c000000000434768] sys_ioctl+0x28/0x80
      [  228.688072] [c000200cdb50fe20] [c00000000000b888] system_call+0x5c/0x70
      [  228.688109] Instruction dump:
      [  228.688142] 4bf6342d 60000000 0fe00000 e8010080 7c0803a6 4bfffe60 3c82ff87 3c62ff87
      [  228.688196] 388472d0 3863d738 4bf63405 60000000 <0fe00000> 4bffff4c 3c82ff87 3c62ff87
      [  228.688251] irq event stamp: 205
      [  228.688287] hardirqs last  enabled at (205): [<c0080000068fc1b4>] kvmhv_run_single_vcpu+0x30c/0x1120 [kvm_hv]
      [  228.688344] hardirqs last disabled at (204): [<c0080000068fbff0>] kvmhv_run_single_vcpu+0x148/0x1120 [kvm_hv]
      [  228.688412] softirqs last  enabled at (180): [<c000000000c0b2ac>] __do_softirq+0x4ac/0x5d4
      [  228.688464] softirqs last disabled at (169): [<c000000000122aa8>] irq_exit+0x1f8/0x210
      [  228.688513] ---[ end trace eb16f6260022a812 ]---
      [  228.688548] possible reason: unannotated irqs-off.
      [  228.688571] irq event stamp: 205
      [  228.688607] hardirqs last  enabled at (205): [<c0080000068fc1b4>] kvmhv_run_single_vcpu+0x30c/0x1120 [kvm_hv]
      [  228.688664] hardirqs last disabled at (204): [<c0080000068fbff0>] kvmhv_run_single_vcpu+0x148/0x1120 [kvm_hv]
      [  228.688719] softirqs last  enabled at (180): [<c000000000c0b2ac>] __do_softirq+0x4ac/0x5d4
      [  228.688758] softirqs last disabled at (169): [<c000000000122aa8>] irq_exit+0x1f8/0x210
      
      Cc: stable@vger.kernel.org # v4.20+
      Fixes: 95a6432c ("KVM: PPC: Book3S HV: Streamlined guest entry/exit path on P9 for radix guests")
      Signed-off-by: default avatarPaul Mackerras <paulus@ozlabs.org>
      Reviewed-by: default avatarCédric Le Goater <clg@kaod.org>
      Tested-by: default avatarCédric Le Goater <clg@kaod.org>
      Signed-off-by: default avatarPaul Mackerras <paulus@ozlabs.org>
      1b28d553
    • Cédric Le Goater's avatar
      KVM: PPC: Book3S HV: XIVE: Fix page offset when clearing ESB pages · bcaa3110
      Cédric Le Goater authored
      Under XIVE, the ESB pages of an interrupt are used for interrupt
      management (EOI) and triggering. They are made available to guests
      through a mapping of the XIVE KVM device.
      
      When a device is passed-through, the passthru_irq helpers,
      kvmppc_xive_set_mapped() and kvmppc_xive_clr_mapped(), clear the ESB
      pages of the guest IRQ number being mapped and let the VM fault
      handler repopulate with the correct page.
      
      The ESB pages are mapped at offset 4 (KVM_XIVE_ESB_PAGE_OFFSET) in the
      KVM device mapping. Unfortunately, this offset was not taken into
      account when clearing the pages. This lead to issues with the
      passthrough devices for which the interrupts were not functional under
      some guest configuration (tg3 and single CPU) or in any configuration
      (e1000e adapter).
      Reviewed-by: default avatarGreg Kurz <groug@kaod.org>
      Tested-by: default avatarGreg Kurz <groug@kaod.org>
      Signed-off-by: default avatarCédric Le Goater <clg@kaod.org>
      Signed-off-by: default avatarPaul Mackerras <paulus@ozlabs.org>
      bcaa3110
    • Cédric Le Goater's avatar
      KVM: PPC: Book3S HV: XIVE: Take the srcu read lock when accessing memslots · aedb5b19
      Cédric Le Goater authored
      According to Documentation/virtual/kvm/locking.txt, the srcu read lock
      should be taken when accessing the memslots of the VM. The XIVE KVM
      device needs to do so when configuring the page of the OS event queue
      of vCPU for a given priority and when marking the same page dirty
      before migration.
      
      This avoids warnings such as :
      
      [  208.224882] =============================
      [  208.224884] WARNING: suspicious RCU usage
      [  208.224889] 5.2.0-rc2-xive+ #47 Not tainted
      [  208.224890] -----------------------------
      [  208.224894] ../include/linux/kvm_host.h:633 suspicious rcu_dereference_check() usage!
      [  208.224896]
                     other info that might help us debug this:
      
      [  208.224898]
                     rcu_scheduler_active = 2, debug_locks = 1
      [  208.224901] no locks held by qemu-system-ppc/3923.
      [  208.224902]
                     stack backtrace:
      [  208.224907] CPU: 64 PID: 3923 Comm: qemu-system-ppc Kdump: loaded Not tainted 5.2.0-rc2-xive+ #47
      [  208.224909] Call Trace:
      [  208.224918] [c000200cdd98fa30] [c000000000be1934] dump_stack+0xe8/0x164 (unreliable)
      [  208.224924] [c000200cdd98fa80] [c0000000001aec80] lockdep_rcu_suspicious+0x110/0x180
      [  208.224935] [c000200cdd98fb00] [c0080000075933a0] gfn_to_memslot+0x1c8/0x200 [kvm]
      [  208.224943] [c000200cdd98fb40] [c008000007599600] gfn_to_pfn+0x28/0x60 [kvm]
      [  208.224951] [c000200cdd98fb70] [c008000007599658] gfn_to_page+0x20/0x40 [kvm]
      [  208.224959] [c000200cdd98fb90] [c0080000075b495c] kvmppc_xive_native_set_attr+0x8b4/0x1480 [kvm]
      [  208.224967] [c000200cdd98fca0] [c00800000759261c] kvm_device_ioctl_attr+0x64/0xb0 [kvm]
      [  208.224974] [c000200cdd98fcf0] [c008000007592730] kvm_device_ioctl+0xc8/0x110 [kvm]
      [  208.224979] [c000200cdd98fd10] [c000000000433a24] do_vfs_ioctl+0xd4/0xcd0
      [  208.224981] [c000200cdd98fdb0] [c000000000434724] ksys_ioctl+0x104/0x120
      [  208.224984] [c000200cdd98fe00] [c000000000434768] sys_ioctl+0x28/0x80
      [  208.224988] [c000200cdd98fe20] [c00000000000b888] system_call+0x5c/0x70
      legoater@boss01:~$
      
      Fixes: 13ce3297 ("KVM: PPC: Book3S HV: XIVE: Add controls for the EQ configuration")
      Fixes: e6714bd1 ("KVM: PPC: Book3S HV: XIVE: Add a control to dirty the XIVE EQ pages")
      Signed-off-by: default avatarCédric Le Goater <clg@kaod.org>
      Signed-off-by: default avatarPaul Mackerras <paulus@ozlabs.org>
      aedb5b19
    • Cédric Le Goater's avatar
      KVM: PPC: Book3S HV: XIVE: Do not clear IRQ data of passthrough interrupts · ef974020
      Cédric Le Goater authored
      The passthrough interrupts are defined at the host level and their IRQ
      data should not be cleared unless specifically deconfigured (shutdown)
      by the host. They differ from the IPI interrupts which are allocated
      by the XIVE KVM device and reserved to the guest usage only.
      
      This fixes a host crash when destroying a VM in which a PCI adapter
      was passed-through. In this case, the interrupt is cleared and freed
      by the KVM device and then shutdown by vfio at the host level.
      
      [ 1007.360265] BUG: Kernel NULL pointer dereference at 0x00000d00
      [ 1007.360285] Faulting instruction address: 0xc00000000009da34
      [ 1007.360296] Oops: Kernel access of bad area, sig: 7 [#1]
      [ 1007.360303] LE PAGE_SIZE=64K MMU=Radix MMU=Hash SMP NR_CPUS=2048 NUMA PowerNV
      [ 1007.360314] Modules linked in: vhost_net vhost iptable_mangle ipt_MASQUERADE iptable_nat nf_nat xt_conntrack nf_conntrack nf_defrag_ipv4 ipt_REJECT nf_reject_ipv4 tun bridge stp llc kvm_hv kvm xt_tcpudp iptable_filter squashfs fuse binfmt_misc vmx_crypto ib_iser rdma_cm iw_cm ib_cm libiscsi scsi_transport_iscsi nfsd ip_tables x_tables autofs4 btrfs zstd_decompress zstd_compress lzo_compress raid10 raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx xor raid6_pq multipath mlx5_ib ib_uverbs ib_core crc32c_vpmsum mlx5_core
      [ 1007.360425] CPU: 9 PID: 15576 Comm: CPU 18/KVM Kdump: loaded Not tainted 5.1.0-gad7e7d0ef #4
      [ 1007.360454] NIP:  c00000000009da34 LR: c00000000009e50c CTR: c00000000009e5d0
      [ 1007.360482] REGS: c000007f24ccf330 TRAP: 0300   Not tainted  (5.1.0-gad7e7d0ef)
      [ 1007.360500] MSR:  900000000280b033 <SF,HV,VEC,VSX,EE,FP,ME,IR,DR,RI,LE>  CR: 24002484  XER: 00000000
      [ 1007.360532] CFAR: c00000000009da10 DAR: 0000000000000d00 DSISR: 00080000 IRQMASK: 1
      [ 1007.360532] GPR00: c00000000009e62c c000007f24ccf5c0 c000000001510600 c000007fe7f947c0
      [ 1007.360532] GPR04: 0000000000000d00 0000000000000000 0000000000000000 c000005eff02d200
      [ 1007.360532] GPR08: 0000000000400000 0000000000000000 0000000000000000 fffffffffffffffd
      [ 1007.360532] GPR12: c00000000009e5d0 c000007fffff7b00 0000000000000031 000000012c345718
      [ 1007.360532] GPR16: 0000000000000000 0000000000000008 0000000000418004 0000000000040100
      [ 1007.360532] GPR20: 0000000000000000 0000000008430000 00000000003c0000 0000000000000027
      [ 1007.360532] GPR24: 00000000000000ff 0000000000000000 00000000000000ff c000007faa90d98c
      [ 1007.360532] GPR28: c000007faa90da40 00000000000fe040 ffffffffffffffff c000007fe7f947c0
      [ 1007.360689] NIP [c00000000009da34] xive_esb_read+0x34/0x120
      [ 1007.360706] LR [c00000000009e50c] xive_do_source_set_mask.part.0+0x2c/0x50
      [ 1007.360732] Call Trace:
      [ 1007.360738] [c000007f24ccf5c0] [c000000000a6383c] snooze_loop+0x15c/0x270 (unreliable)
      [ 1007.360775] [c000007f24ccf5f0] [c00000000009e62c] xive_irq_shutdown+0x5c/0xe0
      [ 1007.360795] [c000007f24ccf630] [c00000000019e4a0] irq_shutdown+0x60/0xe0
      [ 1007.360813] [c000007f24ccf660] [c000000000198c44] __free_irq+0x3a4/0x420
      [ 1007.360831] [c000007f24ccf700] [c000000000198dc8] free_irq+0x78/0xe0
      [ 1007.360849] [c000007f24ccf730] [c00000000096c5a8] vfio_msi_set_vector_signal+0xa8/0x350
      [ 1007.360878] [c000007f24ccf7f0] [c00000000096c938] vfio_msi_set_block+0xe8/0x1e0
      [ 1007.360899] [c000007f24ccf850] [c00000000096cae0] vfio_msi_disable+0xb0/0x110
      [ 1007.360912] [c000007f24ccf8a0] [c00000000096cd04] vfio_pci_set_msi_trigger+0x1c4/0x3d0
      [ 1007.360922] [c000007f24ccf910] [c00000000096d910] vfio_pci_set_irqs_ioctl+0xa0/0x170
      [ 1007.360941] [c000007f24ccf930] [c00000000096b400] vfio_pci_disable+0x80/0x5e0
      [ 1007.360963] [c000007f24ccfa10] [c00000000096b9bc] vfio_pci_release+0x5c/0x90
      [ 1007.360991] [c000007f24ccfa40] [c000000000963a9c] vfio_device_fops_release+0x3c/0x70
      [ 1007.361012] [c000007f24ccfa70] [c0000000003b5668] __fput+0xc8/0x2b0
      [ 1007.361040] [c000007f24ccfac0] [c0000000001409b0] task_work_run+0x140/0x1b0
      [ 1007.361059] [c000007f24ccfb20] [c000000000118f8c] do_exit+0x3ac/0xd00
      [ 1007.361076] [c000007f24ccfc00] [c0000000001199b0] do_group_exit+0x60/0x100
      [ 1007.361094] [c000007f24ccfc40] [c00000000012b514] get_signal+0x1a4/0x8f0
      [ 1007.361112] [c000007f24ccfd30] [c000000000021cc8] do_notify_resume+0x1a8/0x430
      [ 1007.361141] [c000007f24ccfe20] [c00000000000e444] ret_from_except_lite+0x70/0x74
      [ 1007.361159] Instruction dump:
      [ 1007.361175] 38422c00 e9230000 712a0004 41820010 548a2036 7d442378 78840020 71290020
      [ 1007.361194] 4082004c e9230010 7c892214 7c0004ac <e9240000> 0c090000 4c00012c 792a0022
      
      Cc: stable@vger.kernel.org # v4.12+
      Fixes: 5af50993 ("KVM: PPC: Book3S HV: Native usage of the XIVE interrupt controller")
      Signed-off-by: default avatarCédric Le Goater <clg@kaod.org>
      Signed-off-by: default avatarGreg Kurz <groug@kaod.org>
      Signed-off-by: default avatarPaul Mackerras <paulus@ozlabs.org>
      ef974020
    • Cédric Le Goater's avatar
      KVM: PPC: Book3S HV: XIVE: Introduce a new mutex for the XIVE device · 7e10b9a6
      Cédric Le Goater authored
      The XICS-on-XIVE KVM device needs to allocate XIVE event queues when a
      priority is used by the OS. This is referred as EQ provisioning and it
      is done under the hood when :
      
        1. a CPU is hot-plugged in the VM
        2. the "set-xive" is called at VM startup
        3. sources are restored at VM restore
      
      The kvm->lock mutex is used to protect the different XIVE structures
      being modified but in some contexts, kvm->lock is taken under the
      vcpu->mutex which is not permitted by the KVM locking rules.
      
      Introduce a new mutex 'lock' for the KVM devices for them to
      synchronize accesses to the XIVE device structures.
      Reviewed-by: default avatarGreg Kurz <groug@kaod.org>
      Signed-off-by: default avatarCédric Le Goater <clg@kaod.org>
      Signed-off-by: default avatarPaul Mackerras <paulus@ozlabs.org>
      7e10b9a6