- 02 Nov, 2015 2 commits
-
-
Paolo Bonzini authored
Merge branch 'kvm-ppc-next' of git://git.kernel.org/pub/scm/linux/kernel/git/paulus/powerpc into HEAD There's nothing much in the way of new features this time; it's mostly bug fixes, plus Nikunj has implemented support for KVM_CAP_NR_MEMSLOTS.
-
Paolo Bonzini authored
Merge tag 'kvm-s390-next-20151028' of git://git.kernel.org/pub/scm/linux/kernel/git/kvms390/linux into HEAD KVM: s390: Bugfix and cleanups There is one important bug fix for a potential memory corruption and/or guest errors for guests with 63 or 64 vCPUs. This fix would qualify for 4.3 but is some days too late giving that we are about to release 4.3. Given that this patch is cc stable >= 3.15 anyway, we can handle it via 4.4. merge window. This pull request also contains two cleanups.
-
- 29 Oct, 2015 3 commits
-
-
Christian Borntraeger authored
We currently do some magic shifting (by exploiting that exit codes are always a multiple of 4) and a table lookup to jump into the exit handlers. This causes some calculations and checks, just to do an potentially expensive function call. Changing that to a switch statement gives the compiler the chance to inline and dynamically decide between jump tables or inline compare and branches. In addition it makes the code more readable. bloat-o-meter gives me a small reduction in code size: add/remove: 0/7 grow/shrink: 1/1 up/down: 986/-1334 (-348) function old new delta kvm_handle_sie_intercept 72 1058 +986 handle_prog 704 696 -8 handle_noop 54 - -54 handle_partial_execution 60 - -60 intercept_funcs 120 - -120 handle_instruction 198 - -198 handle_validity 210 - -210 handle_stop 316 - -316 handle_external_interrupt 368 - -368 Right now my gcc does conditional branches instead of jump tables. The inlining seems to give us enough cycles as some micro-benchmarking shows minimal improvements, but still in noise. Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com> Reviewed-by: Cornelia Huck <cornelia.huck@de.ibm.com>
-
Christian Borntraeger authored
the s390 debug feature does not need newlines. In fact it will result in empty lines. Get rid of 4 leftovers. Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com> Acked-by: Cornelia Huck <cornelia.huck@de.ibm.com>
-
David Hildenbrand authored
We seemed to have missed a few corner cases in commit f6c137ff ("KVM: s390: randomize sca address"). The SCA has a maximum size of 2112 bytes. By setting the sca_offset to some unlucky numbers, we exceed the page. 0x7c0 (1984) -> Fits exactly 0x7d0 (2000) -> 16 bytes out 0x7e0 (2016) -> 32 bytes out 0x7f0 (2032) -> 48 bytes out One VCPU entry is 32 bytes long. For the last two cases, we actually write data to the other page. 1. The address of the VCPU. 2. Injection/delivery/clearing of SIGP externall calls via SIGP IF. Especially the 2. happens regularly. So this could produce two problems: 1. The guest losing/getting external calls. 2. Random memory overwrites in the host. So this problem happens on every 127 + 128 created VM with 64 VCPUs. Cc: stable@vger.kernel.org # v3.15+ Acked-by: Christian Borntraeger <borntraeger@de.ibm.com> Signed-off-by: David Hildenbrand <dahi@linux.vnet.ibm.com> Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com>
-
- 21 Oct, 2015 4 commits
-
-
Gautham R. Shenoy authored
Currently a CPU running a guest can receive a H_DOORBELL in the following two cases: 1) When the CPU is napping due to CEDE or there not being a guest vcpu. 2) The CPU is running the guest vcpu. Case 1), the doorbell message is not cleared since we were waking up from nap. Hence when the EE bit gets set on transition from guest to host, the H_DOORBELL interrupt is delivered to the host and the corresponding handler is invoked. However in Case 2), the message gets cleared by the action of taking the H_DOORBELL interrupt. Since the CPU was running a guest, instead of invoking the doorbell handler, the code invokes the second-level interrupt handler to switch the context from the guest to the host. At this point the setting of the EE bit doesn't result in the CPU getting the doorbell interrupt since it has already been delivered once. So, the handler for this doorbell is never invoked! This causes softlockups if the missed DOORBELL was an IPI sent from a sibling subcore on the same CPU. This patch fixes it by explitly invoking the doorbell handler on the exit path if the exit reason is H_DOORBELL similar to the way an EXTERNAL interrupt is handled. Since this will also handle Case 1), we can unconditionally clear the doorbell message in kvmppc_check_wake_reason. Signed-off-by: Gautham R. Shenoy <ego@linux.vnet.ibm.com> Signed-off-by: Paul Mackerras <paulus@samba.org>
-
Nikunj A Dadhania authored
QEMU assumes 32 memslots if this extension is not implemented. Although, current value of KVM_USER_MEM_SLOTS is 32, once KVM_USER_MEM_SLOTS changes QEMU would take a wrong value. Signed-off-by: Nikunj A Dadhania <nikunj@linux.vnet.ibm.com> Reviewed-by: Thomas Huth <thuth@redhat.com> Signed-off-by: Paul Mackerras <paulus@samba.org>
-
Paul Mackerras authored
This fixes a bug where the old HPTE value returned by H_REMOVE has the valid bit clear if the HPTE was an absent HPTE, as happens for HPTEs for emulated MMIO pages and for RAM pages that have been paged out by the host. If the absent bit is set, we clear it and set the valid bit, because from the guest's point of view, the HPTE is valid. Signed-off-by: Paul Mackerras <paulus@samba.org>
-
Paul Mackerras authored
Currently the KVM_PPC_ALLOCATE_HTAB will try to allocate the requested size of HPT, and if that is not possible, then try to allocate smaller sizes (by factors of 2) until either a minimum is reached or the allocation succeeds. This is not ideal for userspace, particularly in migration scenarios, where the destination VM really does require the size requested. Also, the minimum HPT size of 256kB may be insufficient for the guest to run successfully. This removes the fallback to smaller sizes on allocation failure for the KVM_PPC_ALLOCATE_HTAB ioctl. The fallback still exists for the case where the HPT is allocated at the time the first VCPU is run, if no HPT has been allocated by ioctl by that time. Signed-off-by: Paul Mackerras <paulus@samba.org>
-
- 19 Oct, 2015 2 commits
-
-
Takuya Yoshikawa authored
Commit fd136902 ("KVM: x86: MMU: Move mapping_level_dirty_bitmap() call in mapping_level()") forgot to initialize force_pt_level to false in FNAME(page_fault)() before calling mapping_level() like nonpaging_map() does. This can sometimes result in forcing page table level mapping unnecessarily. Fix this and move the first *force_pt_level check in mapping_level() before kvm_vcpu_gfn_to_memslot() call to make it a bit clearer that the variable must be initialized before mapping_level() gets called. This change can also avoid calling kvm_vcpu_gfn_to_memslot() when !check_hugepage_cache_consistency() check in tdp_page_fault() forces page table level mapping. Signed-off-by: Takuya Yoshikawa <yoshikawa_takuya_b1@lab.ntt.co.jp> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
-
Paolo Bonzini authored
Not zeroing EFER means that a 32-bit firmware cannot enter paging mode without clearing EFER.LME first (which it should not know about). Yang Zhang from Intel confirmed that the manual is wrong and EFER is cleared to zero on INIT. Fixes: d28bc9dd Cc: stable@vger.kernel.org Cc: Yang Z Zhang <yang.z.zhang@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
-
- 16 Oct, 2015 16 commits
-
-
Andrey Smetanin authored
Any other irq routing types (MSI, S390_ADAPTER, upcoming Hyper-V SynIC) map one-to-one to GSI. Signed-off-by: Andrey Smetanin <asmetanin@virtuozzo.com> Reviewed-by: Roman Kagan <rkagan@virtuozzo.com> Signed-off-by: Denis V. Lunev <den@openvz.org> CC: Vitaly Kuznetsov <vkuznets@redhat.com> CC: "K. Y. Srinivasan" <kys@microsoft.com> CC: Gleb Natapov <gleb@kernel.org> CC: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
-
Andrey Smetanin authored
Allow for arch-specific interrupt types to be set. For that, add kvm_arch_set_irq() which takes interrupt type-specific action if it recognizes the interrupt type given, and -EWOULDBLOCK otherwise. The default implementation always returns -EWOULDBLOCK. Signed-off-by: Andrey Smetanin <asmetanin@virtuozzo.com> Reviewed-by: Roman Kagan <rkagan@virtuozzo.com> Signed-off-by: Denis V. Lunev <den@openvz.org> CC: Vitaly Kuznetsov <vkuznets@redhat.com> CC: "K. Y. Srinivasan" <kys@microsoft.com> CC: Gleb Natapov <gleb@kernel.org> CC: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
-
Andrey Smetanin authored
Factor out kvm_notify_acked_gsi() helper to iterate over EOI listeners and notify those matching the given gsi. It will be reused in the upcoming Hyper-V SynIC implementation. Signed-off-by: Andrey Smetanin <asmetanin@virtuozzo.com> Reviewed-by: Roman Kagan <rkagan@virtuozzo.com> Signed-off-by: Denis V. Lunev <den@openvz.org> CC: Vitaly Kuznetsov <vkuznets@redhat.com> CC: "K. Y. Srinivasan" <kys@microsoft.com> CC: Gleb Natapov <gleb@kernel.org> CC: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
-
Andrey Smetanin authored
The loop(for) inside irqfd_update() is unnecessary because any other value for irq_entry.type will just trigger schedule_work(&irqfd->inject) in irqfd_wakeup. Signed-off-by: Andrey Smetanin <asmetanin@virtuozzo.com> Reviewed-by: Roman Kagan <rkagan@virtuozzo.com> Signed-off-by: Denis V. Lunev <den@openvz.org> CC: Vitaly Kuznetsov <vkuznets@redhat.com> CC: "K. Y. Srinivasan" <kys@microsoft.com> CC: Gleb Natapov <gleb@kernel.org> CC: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
-
Marcelo Tosatti authored
As reported at https://bugs.launchpad.net/qemu/+bug/1494350, it is possible to have vcpu->arch.st.last_steal initialized from a thread other than vcpu thread, say the iothread, via KVM_SET_MSRS. Which can cause an overflow later (when subtracting from vcpu threads sched_info.run_delay). To avoid that, move steal time accumulation to vcpu entry time, before copying steal time data to guest. Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com> Reviewed-by: David Matlack <dmatlack@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
-
Takuya Yoshikawa authored
Calling kvm_vcpu_gfn_to_memslot() twice in mapping_level() should be avoided since getting a slot by binary search may not be negligible, especially for virtual machines with many memory slots. Signed-off-by: Takuya Yoshikawa <yoshikawa_takuya_b1@lab.ntt.co.jp> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
-
Takuya Yoshikawa authored
Now that it has only one caller, and its name is not so helpful for readers, remove it. The new memslot_valid_for_gpte() function makes it possible to share the common code between gfn_to_memslot_dirty_bitmap() and mapping_level(). Signed-off-by: Takuya Yoshikawa <yoshikawa_takuya_b1@lab.ntt.co.jp> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
-
Takuya Yoshikawa authored
This is necessary to eliminate an extra memory slot search later. Signed-off-by: Takuya Yoshikawa <yoshikawa_takuya_b1@lab.ntt.co.jp> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
-
Takuya Yoshikawa authored
As a bonus, an extra memory slot search can be eliminated when is_self_change_mapping is true. Signed-off-by: Takuya Yoshikawa <yoshikawa_takuya_b1@lab.ntt.co.jp> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
-
Takuya Yoshikawa authored
This will be passed to a function later. Signed-off-by: Takuya Yoshikawa <yoshikawa_takuya_b1@lab.ntt.co.jp> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
-
Joerg Roedel authored
Currently we always write the next_rip of the shadow vmcb to the guests vmcb when we emulate a vmexit. This could confuse the guest when its cpuid indicated no support for the next_rip feature. Fix this by only propagating next_rip if the guest actually supports it. Cc: Bandan Das <bsd@redhat.com> Cc: Dirk Mueller <dmueller@suse.com> Tested-By: Dirk Mueller <dmueller@suse.com> Signed-off-by: Joerg Roedel <jroedel@suse.de> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
-
Paolo Bonzini authored
The loop is computing one of two constants, it can be simpler to write everything inline. Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
-
Wanpeng Li authored
Expose VPID capability to L1. For nested guests, we don't do anything specific for single context invalidation. Hence, only advertise support for global context invalidation. The major benefit of nested VPID comes from having separate vpids when switching between L1 and L2, and also when L2's vCPUs not sched in/out on L1. Reviewed-by: Wincy Van <fanwenyi0529@gmail.com> Signed-off-by: Wanpeng Li <wanpeng.li@hotmail.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
-
Wanpeng Li authored
VPID is used to tag address space and avoid a TLB flush. Currently L0 use the same VPID to run L1 and all its guests. KVM flushes VPID when switching between L1 and L2. This patch advertises VPID to the L1 hypervisor, then address space of L1 and L2 can be separately treated and avoid TLB flush when swithing between L1 and L2. For each nested vmentry, if vpid12 is changed, reuse shadow vpid w/ an invvpid. Performance: run lmbench on L2 w/ 3.5 kernel. Context switching - times in microseconds - smaller is better ------------------------------------------------------------------------- Host OS 2p/0K 2p/16K 2p/64K 8p/16K 8p/64K 16p/16K 16p/64K ctxsw ctxsw ctxsw ctxsw ctxsw ctxsw ctxsw --------- ------------- ------ ------ ------ ------ ------ ------- ------- kernel Linux 3.5.0-1 1.2200 1.3700 1.4500 4.7800 2.3300 5.60000 2.88000 nested VPID kernel Linux 3.5.0-1 1.2600 1.4300 1.5600 12.7 12.9 3.49000 7.46000 vanilla Reviewed-by: Jan Kiszka <jan.kiszka@siemens.com> Reviewed-by: Wincy Van <fanwenyi0529@gmail.com> Signed-off-by: Wanpeng Li <wanpeng.li@hotmail.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
-
Wanpeng Li authored
Add the INVVPID instruction emulation. Reviewed-by: Wincy Van <fanwenyi0529@gmail.com> Signed-off-by: Wanpeng Li <wanpeng.li@hotmail.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
-
Mahesh Salgaonkar authored
For the machine check interrupt that happens while we are in the guest, kvm layer attempts the recovery, and then delivers the machine check interrupt directly to the guest if recovery fails. On successful recovery we go back to normal functioning of the guest. But there can be cases where a machine check interrupt can happen with MSR(RI=0) while we are in the guest. This means MC interrupt is unrecoverable and we have to deliver a machine check to the guest since the machine check interrupt might have trashed valid values in SRR0/1. The current implementation do not handle this case, causing guest to crash with Bad kernel stack pointer instead of machine check oops message. [26281.490060] Bad kernel stack pointer 3fff9ccce5b0 at c00000000000490c [26281.490434] Oops: Bad kernel stack pointer, sig: 6 [#1] [26281.490472] SMP NR_CPUS=2048 NUMA pSeries This patch fixes this issue by checking MSR(RI=0) in KVM layer and forwarding unrecoverable interrupt to guest which then panics with proper machine check Oops message. Signed-off-by: Mahesh Salgaonkar <mahesh@linux.vnet.ibm.com> Acked-by: Paul Mackerras <paulus@samba.org> Signed-off-by: Paul Mackerras <paulus@samba.org>
-
- 15 Oct, 2015 4 commits
-
-
Tudor Laurentiu authored
Fix couple of cases where we shift left a 32-bit value thus might get truncated results on 64-bit targets. Signed-off-by: Laurentiu Tudor <Laurentiu.Tudor@freescale.com> Suggested-by: Scott Wood <scotttwood@freescale.com> Acked-by: Scott Wood <scottwood@freescale.com> Signed-off-by: Paul Mackerras <paulus@samba.org>
-
Tudor Laurentiu authored
Emulate TMCFG0 TMRN register exposing one HW thread per vcpu. Signed-off-by: Mihai Caraman <mihai.caraman@freescale.com> [Laurentiu.Tudor@freescale.com: rebased on latest kernel, use define instead of hardcoded value, moved code in own function] Signed-off-by: Laurentiu Tudor <Laurentiu.Tudor@freescale.com> Acked-by: Scott Wood <scotttwood@freescale.com> Signed-off-by: Paul Mackerras <paulus@samba.org>
-
Andrzej Hajda authored
The function can return negative value. The problem has been detected using proposed semantic patch scripts/coccinelle/tests/assign_signed_to_unsigned.cocci [1]. [1]: http://permalink.gmane.org/gmane.linux.kernel/2046107Signed-off-by: Andrzej Hajda <a.hajda@samsung.com> Acked-by: Scott Wood <scottwood@freescale.com> Signed-off-by: Paul Mackerras <paulus@samba.org>
-
Tudor Laurentiu authored
The register is not currently used in the base kernel but will be in a forthcoming kvm patch. Signed-off-by: Laurentiu Tudor <Laurentiu.Tudor@freescale.com> Acked-by: Scott Wood <scottwood@freescale.com> Signed-off-by: Paul Mackerras <paulus@samba.org>
-
- 14 Oct, 2015 9 commits
-
-
Wanpeng Li authored
Introduce __vmx_flush_tlb() to handle specific vpid. Signed-off-by: Wanpeng Li <wanpeng.li@hotmail.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
-
Wanpeng Li authored
Adjust allocate/free_vid so that they can be reused for the nested vpid. Signed-off-by: Wanpeng Li <wanpeng.li@hotmail.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
-
Kosuke Tatsukawa authored
async_pf_execute() seems to be missing a memory barrier which might cause the waker to not notice the waiter and miss sending a wake_up as in the following figure. async_pf_execute kvm_vcpu_block ------------------------------------------------------------------------ spin_lock(&vcpu->async_pf.lock); if (waitqueue_active(&vcpu->wq)) /* The CPU might reorder the test for the waitqueue up here, before prior writes complete */ prepare_to_wait(&vcpu->wq, &wait, TASK_INTERRUPTIBLE); /*if (kvm_vcpu_check_block(vcpu) < 0) */ /*if (kvm_arch_vcpu_runnable(vcpu)) { */ ... return (vcpu->arch.mp_state == KVM_MP_STATE_RUNNABLE && !vcpu->arch.apf.halted) || !list_empty_careful(&vcpu->async_pf.done) ... return 0; list_add_tail(&apf->link, &vcpu->async_pf.done); spin_unlock(&vcpu->async_pf.lock); waited = true; schedule(); ------------------------------------------------------------------------ The attached patch adds the missing memory barrier. I found this issue when I was looking through the linux source code for places calling waitqueue_active() before wake_up*(), but without preceding memory barriers, after sending a patch to fix a similar issue in drivers/tty/n_tty.c (Details about the original issue can be found here: https://lkml.org/lkml/2015/9/28/849). Signed-off-by: Kosuke Tatsukawa <tatsu@ab.jp.nec.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
-
Radim Krčmář authored
On real hardware, edge-triggered interrupts don't set a bit in TMR, which means that IOAPIC isn't notified on EOI. Do the same here. Staying in guest/kernel mode after edge EOI is what we want for most devices. If some bugs could be nicely worked around with edge EOI notifications, we should invest in a better interface. Signed-off-by: Radim Krčmář <rkrcmar@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
-
Radim Krčmář authored
KVM uses eoi_exit_bitmap to track vectors that need an action on EOI. The problem is that IOAPIC can be reconfigured while an interrupt with old configuration is pending and eoi_exit_bitmap only remembers the newest configuration; thus EOI from the pending interrupt is not recognized. (Reconfiguration is not a problem for level interrupts, because IOAPIC sends interrupt with the new configuration.) For an edge interrupt with ACK notifiers, like i8254 timer; things can happen in this order 1) IOAPIC inject a vector from i8254 2) guest reconfigures that vector's VCPU and therefore eoi_exit_bitmap on original VCPU gets cleared 3) guest's handler for the vector does EOI 4) KVM's EOI handler doesn't pass that vector to IOAPIC because it is not in that VCPU's eoi_exit_bitmap 5) i8254 stops working A simple solution is to set the IOAPIC vector in eoi_exit_bitmap if the vector is in PIR/IRR/ISR. This creates an unwanted situation if the vector is reused by a non-IOAPIC source, but I think it is so rare that we don't want to make the solution more sophisticated. The simple solution also doesn't work if we are reconfiguring the vector. (Shouldn't happen in the wild and I'd rather fix users of ACK notifiers instead of working around that.) The are no races because ioapic injection and reconfig are locked. Fixes: b053b2ae ("KVM: x86: Add EOI exit bitmap inference") [Before b053b2ae, this bug happened only with APICv.] Fixes: c7c9c56c ("x86, apicv: add virtual interrupt delivery support") Cc: <stable@vger.kernel.org> Signed-off-by: Radim Krčmář <rkrcmar@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
-
Radim Krčmář authored
After moving PIR to IRR, the interrupt needs to be delivered manually. Reported-by: Paolo Bonzini <pbonzini@redhat.com> Cc: <stable@vger.kernel.org> Signed-off-by: Radim Krčmář <rkrcmar@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
-
Paolo Bonzini authored
Merge more important SMM fixes.
-
Paolo Bonzini authored
In order to get into 64-bit protected mode, you need to enable paging while EFER.LMA=1. For this to work, CS.L must be 0. Currently, we load the segments before CR0 and CR4, which means that if RSM returns into 64-bit protected mode CS.L is already 1 and everything breaks. Luckily, CS.L=0 is always the case when executing RSM, because it is forbidden to execute RSM from 64-bit protected mode. Hence it is enough to load CR0 and CR4 first, and only then the segments. Fixes: 660a5d51 Cc: stable@vger.kernel.org Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
-
Paolo Bonzini authored
Unfortunately I only noticed this after pushing. Fixes: f0d648bd Cc: stable@vger.kernel.org Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
-