1. 03 Jul, 2017 2 commits
    • Paolo Bonzini's avatar
      Merge branch 'kvm-ppc-next' of... · 8a53e7e5
      Paolo Bonzini authored
      Merge branch 'kvm-ppc-next' of git://git.kernel.org/pub/scm/linux/kernel/git/paulus/powerpc into HEAD
      
      - Better machine check handling for HV KVM
      - Ability to support guests with threads=2, 4 or 8 on POWER9
      - Fix for a race that could cause delayed recognition of signals
      - Fix for a bug where POWER9 guests could sleep with interrupts
        pending.
      8a53e7e5
    • Paul Mackerras's avatar
      KVM: PPC: Book3S: Fix typo in XICS-on-XIVE state saving code · 00c14757
      Paul Mackerras authored
      This fixes a typo where the wrong loop index was used to index
      the kvmppc_xive_vcpu.queues[] array in xive_pre_save_scan().
      The variable i contains the vcpu number; we need to index queues[]
      using j, which iterates from 0 to KVMPPC_XIVE_Q_COUNT-1.
      
      The effect of this bug is that things that save the interrupt
      controller state, such as "virsh dump", on a VM with more than
      8 vCPUs, result in xive_pre_save_queue() getting called on a
      bogus queue structure, usually resulting in a crash like this:
      
      [  501.821107] Unable to handle kernel paging request for data at address 0x00000084
      [  501.821212] Faulting instruction address: 0xc008000004c7c6f8
      [  501.821234] Oops: Kernel access of bad area, sig: 11 [#1]
      [  501.821305] SMP NR_CPUS=1024
      [  501.821307] NUMA
      [  501.821376] PowerNV
      [  501.821470] Modules linked in: vhost_net vhost tap xt_CHECKSUM ipt_MASQUERADE nf_nat_masquerade_ipv4 ip6t_rpfilter ip6t_REJECT nf_reject_ipv6 nf_conntrack_ipv6 nf_defrag_ipv6 xt_conntrack ip_set nfnetlink ebtable_nat ebtable_broute bridge stp llc ip6table_mangle ip6table_security ip6table_raw iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 nf_nat nf_conntrack libcrc32c iptable_mangle iptable_security iptable_raw ebtable_filter ebtables ip6table_filter ip6_tables ses enclosure scsi_transport_sas ipmi_powernv ipmi_devintf ipmi_msghandler powernv_op_panel kvm_hv nfsd auth_rpcgss oid_registry nfs_acl lockd grace sunrpc kvm tg3 ptp pps_core
      [  501.822477] CPU: 3 PID: 3934 Comm: live_migration Not tainted 4.11.0-4.git8caa70f.el7.centos.ppc64le #1
      [  501.822633] task: c0000003f9e3ae80 task.stack: c0000003f9ed4000
      [  501.822745] NIP: c008000004c7c6f8 LR: c008000004c7c628 CTR: 0000000030058018
      [  501.822877] REGS: c0000003f9ed7980 TRAP: 0300   Not tainted  (4.11.0-4.git8caa70f.el7.centos.ppc64le)
      [  501.823030] MSR: 9000000000009033 <SF,HV,EE,ME,IR,DR,RI,LE>
      [  501.823047]   CR: 28022244  XER: 00000000
      [  501.823203] CFAR: c008000004c7c77c DAR: 0000000000000084 DSISR: 40000000 SOFTE: 1
      [  501.823203] GPR00: c008000004c7c628 c0000003f9ed7c00 c008000004c91450 00000000000000ff
      [  501.823203] GPR04: c0000003f5580000 c0000003f559bf98 9000000000009033 0000000000000000
      [  501.823203] GPR08: 0000000000000084 0000000000000000 00000000000001e0 9000000000001003
      [  501.823203] GPR12: c00000000008a7d0 c00000000fdc1b00 000000000a9a0000 0000000000000000
      [  501.823203] GPR16: 00000000402954e8 000000000a9a0000 0000000000000004 0000000000000000
      [  501.823203] GPR20: 0000000000000008 c000000002e8f180 c000000002e8f1e0 0000000000000001
      [  501.823203] GPR24: 0000000000000008 c0000003f5580008 c0000003f4564018 c000000002e8f1e8
      [  501.823203] GPR28: 00003ff6e58bdc28 c0000003f4564000 0000000000000000 0000000000000000
      [  501.825441] NIP [c008000004c7c6f8] xive_get_attr+0x3b8/0x5b0 [kvm]
      [  501.825671] LR [c008000004c7c628] xive_get_attr+0x2e8/0x5b0 [kvm]
      [  501.825887] Call Trace:
      [  501.825991] [c0000003f9ed7c00] [c008000004c7c628] xive_get_attr+0x2e8/0x5b0 [kvm] (unreliable)
      [  501.826312] [c0000003f9ed7cd0] [c008000004c62ec4] kvm_device_ioctl_attr+0x64/0xa0 [kvm]
      [  501.826581] [c0000003f9ed7d20] [c008000004c62fcc] kvm_device_ioctl+0xcc/0xf0 [kvm]
      [  501.826843] [c0000003f9ed7d40] [c000000000350c70] do_vfs_ioctl+0xd0/0x8c0
      [  501.827060] [c0000003f9ed7de0] [c000000000351534] SyS_ioctl+0xd4/0xf0
      [  501.827282] [c0000003f9ed7e30] [c00000000000b8e0] system_call+0x38/0xfc
      [  501.827496] Instruction dump:
      [  501.827632] 419e0078 3b760008 e9160008 83fb000c 83db0010 80fb0008 2f280000 60000000
      [  501.827901] 60000000 60420000 419a0050 7be91764 <7d284c2c> 552a0ffe 7f8af040 419e003c
      [  501.828176] ---[ end trace 2d0529a5bbbbafed ]---
      
      Cc: stable@vger.kernel.org
      Fixes: 5af50993 ("KVM: PPC: Book3S HV: Native usage of the XIVE interrupt controller")
      Acked-by: default avatarBenjamin Herrenschmidt <benh@kernel.crashing.org>
      Signed-off-by: default avatarPaul Mackerras <paulus@ozlabs.org>
      00c14757
  2. 01 Jul, 2017 2 commits
    • Paul Mackerras's avatar
      KVM: PPC: Book3S HV: Close race with testing for signals on guest entry · 8b24e69f
      Paul Mackerras authored
      At present, interrupts are hard-disabled fairly late in the guest
      entry path, in the assembly code.  Since we check for pending signals
      for the vCPU(s) task(s) earlier in the guest entry path, it is
      possible for a signal to be delivered before we enter the guest but
      not be noticed until after we exit the guest for some other reason.
      
      Similarly, it is possible for the scheduler to request a reschedule
      while we are in the guest entry path, and we won't notice until after
      we have run the guest, potentially for a whole timeslice.
      
      Furthermore, with a radix guest on POWER9, we can take the interrupt
      with the MMU on.  In this case we end up leaving interrupts
      hard-disabled after the guest exit, and they are likely to stay
      hard-disabled until we exit to userspace or context-switch to
      another process.  This was masking the fact that we were also not
      setting the RI (recoverable interrupt) bit in the MSR, meaning
      that if we had taken an interrupt, it would have crashed the host
      kernel with an unrecoverable interrupt message.
      
      To close these races, we need to check for signals and reschedule
      requests after hard-disabling interrupts, and then keep interrupts
      hard-disabled until we enter the guest.  If there is a signal or a
      reschedule request from another CPU, it will send an IPI, which will
      cause a guest exit.
      
      This puts the interrupt disabling before we call kvmppc_start_thread()
      for all the secondary threads of this core that are going to run vCPUs.
      The reason for that is that once we have started the secondary threads
      there is no easy way to back out without going through at least part
      of the guest entry path.  However, kvmppc_start_thread() includes some
      code for radix guests which needs to call smp_call_function(), which
      must be called with interrupts enabled.  To solve this problem, this
      patch moves that code into a separate function that is called earlier.
      
      When the guest exit is caused by an external interrupt, a hypervisor
      doorbell or a hypervisor maintenance interrupt, we now handle these
      using the replay facility.  __kvmppc_vcore_entry() now returns the
      trap number that caused the exit on this thread, and instead of the
      assembly code jumping to the handler entry, we return to C code with
      interrupts still hard-disabled and set the irq_happened flag in the
      PACA, so that when we do local_irq_enable() the appropriate handler
      gets called.
      
      With all this, we now have the interrupt soft-enable flag clear while
      we are in the guest.  This is useful because code in the real-mode
      hypercall handlers that checks whether interrupts are enabled will
      now see that they are disabled, which is correct, since interrupts
      are hard-disabled in the real-mode code.
      Signed-off-by: default avatarPaul Mackerras <paulus@ozlabs.org>
      8b24e69f
    • Paul Mackerras's avatar
      KVM: PPC: Book3S HV: Simplify dynamic micro-threading code · 898b25b2
      Paul Mackerras authored
      Since commit b009031f ("KVM: PPC: Book3S HV: Take out virtual
      core piggybacking code", 2016-09-15), we only have at most one
      vcore per subcore.  Previously, the fact that there might be more
      than one vcore per subcore meant that we had the notion of a
      "master vcore", which was the vcore that controlled thread 0 of
      the subcore.  We also needed a list per subcore in the core_info
      struct to record which vcores belonged to each subcore.  Now that
      there can only be one vcore in the subcore, we can replace the
      list with a simple pointer and get rid of the notion of the
      master vcore (and in fact treat every vcore as a master vcore).
      
      We can also get rid of the subcore_vm[] field in the core_info
      struct since it is never read.
      Reviewed-by: default avatarDavid Gibson <david@gibson.dropbear.id.au>
      Signed-off-by: default avatarPaul Mackerras <paulus@ozlabs.org>
      898b25b2
  3. 30 Jun, 2017 2 commits
  4. 29 Jun, 2017 3 commits
    • Wanpeng Li's avatar
      KVM: LAPIC: Fix lapic timer injection delay · c8533544
      Wanpeng Li authored
      If the TSC deadline timer is programmed really close to the deadline or
      even in the past, the computation in vmx_set_hv_timer will program the
      absolute target tsc value to vmcs preemption timer field w/ delta == 0,
      then plays a vmentry and an upcoming vmx preemption timer fire vmexit
      dance, the lapic timer injection is delayed due to this duration. Actually
      the lapic timer which is emulated by hrtimer can handle this correctly.
      
      This patch fixes it by firing the lapic timer and injecting a timer interrupt
      immediately during the next vmentry if the TSC deadline timer is programmed
      really close to the deadline or even in the past. This saves ~300 cycles on
      the tsc_deadline_timer test of apic.flat.
      
      Cc: Paolo Bonzini <pbonzini@redhat.com>
      Cc: Radim Krčmář <rkrcmar@redhat.com>
      Signed-off-by: default avatarWanpeng Li <wanpeng.li@hotmail.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      c8533544
    • Paolo Bonzini's avatar
      KVM: lapic: reorganize restart_apic_timer · a749e247
      Paolo Bonzini authored
      Move the code to cancel the hv timer into the caller, just before
      it starts the hrtimer.  Check availability of the hv timer in
      start_hv_timer.
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      a749e247
    • Paolo Bonzini's avatar
      KVM: lapic: reorganize start_hv_timer · 35ee9e48
      Paolo Bonzini authored
      There are many cases in which the hv timer must be canceled.  Split out
      a new function to avoid duplication.
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      35ee9e48
  5. 28 Jun, 2017 5 commits
  6. 27 Jun, 2017 12 commits
  7. 22 Jun, 2017 9 commits
  8. 21 Jun, 2017 2 commits
  9. 20 Jun, 2017 1 commit
    • Paul Mackerras's avatar
      KVM: PPC: Book3S HV: Don't sleep if XIVE interrupt pending on POWER9 · ee3308a2
      Paul Mackerras authored
      On a POWER9 system, it is possible for an interrupt to become pending
      for a VCPU when that VCPU is about to cede (execute a H_CEDE hypercall)
      and has already disabled interrupts, or in the H_CEDE processing up
      to the point where the XIVE context is pulled from the hardware.  In
      such a case, the H_CEDE should not sleep, but should return immediately
      to the guest.  However, the conditions tested in kvmppc_vcpu_woken()
      don't include the condition that a XIVE interrupt is pending, so the
      VCPU could sleep until the next decrementer interrupt.
      
      To fix this, we add a new xive_interrupt_pending() helper which looks
      in the XIVE context that was pulled from the hardware to see if the
      priority of any pending interrupt is higher (numerically lower than)
      the CPU priority.  If so then kvmppc_vcpu_woken() will return true.
      If the XIVE context has never been used, then both the pipr and the
      cppr fields will be zero and the test will indicate that no interrupt
      is pending.
      Signed-off-by: default avatarPaul Mackerras <paulus@ozlabs.org>
      ee3308a2
  10. 19 Jun, 2017 2 commits
    • Paul Mackerras's avatar
      KVM: PPC: Book3S HV: Virtualize doorbell facility on POWER9 · 57900694
      Paul Mackerras authored
      On POWER9, we no longer have the restriction that we had on POWER8
      where all threads in a core have to be in the same partition, so
      the CPU threads are now independent.  However, we still want to be
      able to run guests with a virtual SMT topology, if only to allow
      migration of guests from POWER8 systems to POWER9.
      
      A guest that has a virtual SMT mode greater than 1 will expect to
      be able to use the doorbell facility; it will expect the msgsndp
      and msgclrp instructions to work appropriately and to be able to read
      sensible values from the TIR (thread identification register) and
      DPDES (directed privileged doorbell exception status) special-purpose
      registers.  However, since each CPU thread is a separate sub-processor
      in POWER9, these instructions and registers can only be used within
      a single CPU thread.
      
      In order for these instructions to appear to act correctly according
      to the guest's virtual SMT mode, we have to trap and emulate them.
      We cause them to trap by clearing the HFSCR_MSGP bit in the HFSCR
      register.  The emulation is triggered by the hypervisor facility
      unavailable interrupt that occurs when the guest uses them.
      
      To cause a doorbell interrupt to occur within the guest, we set the
      DPDES register to 1.  If the guest has interrupts enabled, the CPU
      will generate a doorbell interrupt and clear the DPDES register in
      hardware.  The DPDES hardware register for the guest is saved in the
      vcpu->arch.vcore->dpdes field.  Since this gets written by the guest
      exit code, other VCPUs wishing to cause a doorbell interrupt don't
      write that field directly, but instead set a vcpu->arch.doorbell_request
      flag.  This is consumed and set to 0 by the guest entry code, which
      then sets DPDES to 1.
      
      Emulating reads of the DPDES register is somewhat involved, because
      it requires reading the doorbell pending interrupt status of all of the
      VCPU threads in the virtual core, and if any of those VCPUs are
      running, their doorbell status is only up-to-date in the hardware
      DPDES registers of the CPUs where they are running.  In order to get
      a reasonable approximation of the current doorbell status, we send
      those CPUs an IPI, causing an exit from the guest which will update
      the vcpu->arch.vcore->dpdes field.  We then use that value in
      constructing the emulated DPDES register value.
      Signed-off-by: default avatarPaul Mackerras <paulus@ozlabs.org>
      57900694
    • Paul Mackerras's avatar
      KVM: PPC: Book3S HV: Allow userspace to set the desired SMT mode · 3c313524
      Paul Mackerras authored
      This allows userspace to set the desired virtual SMT (simultaneous
      multithreading) mode for a VM, that is, the number of VCPUs that
      get assigned to each virtual core.  Previously, the virtual SMT mode
      was fixed to the number of threads per subcore, and if userspace
      wanted to have fewer vcpus per vcore, then it would achieve that by
      using a sparse CPU numbering.  This had the disadvantage that the
      vcpu numbers can get quite large, particularly for SMT1 guests on
      a POWER8 with 8 threads per core.  With this patch, userspace can
      set its desired virtual SMT mode and then use contiguous vcpu
      numbering.
      
      On POWER8, where the threading mode is "strict", the virtual SMT mode
      must be less than or equal to the number of threads per subcore.  On
      POWER9, which implements a "loose" threading mode, the virtual SMT
      mode can be any power of 2 between 1 and 8, even though there is
      effectively one thread per subcore, since the threads are independent
      and can all be in different partitions.
      Signed-off-by: default avatarPaul Mackerras <paulus@ozlabs.org>
      3c313524