- 31 Jan, 2018 1 commit
-
-
Cornelia Huck authored
Acked-by: Janosch Frank <frankja@linux.vnet.ibm.com> Acked-by: Christian Borntraeger <borntraeger@de.ibm.com> Acked-by: David Hildenbrand <david@redhat.com> Signed-off-by: Cornelia Huck <cohuck@redhat.com>
-
- 30 Jan, 2018 1 commit
-
-
git://git.kernel.org/pub/scm/linux/kernel/git/kvms390/linuxRadim Krčmář authored
KVM: s390: Fixes and features for 4.16 part 2 - exitless interrupts for emulated devices (Michael Mueller) - cleanup of cpuflag handling (David Hildenbrand) - kvm stat counter improvements (Christian Borntraeger) - vsie improvements (David Hildenbrand) - mm cleanup (Janosch Frank)
-
- 26 Jan, 2018 12 commits
-
-
Michael Mueller authored
The patch modifies the previously defined GISA data structure to be able to store two GISA formats, format-0 and format-1. Additionally, it verifies the availability of the GISA format facility and enables the use of a format-1 GISA in the SIE control block accordingly. A format-1 can do everything that format-0 can and we will need it for real HW passthrough. As there are systems with only format-0 we keep both variants. Signed-off-by: Michael Mueller <mimu@linux.vnet.ibm.com> Reviewed-by: Pierre Morel <pmorel@linux.vnet.ibm.com> Reviewed-by: Christian Borntraeger <borntraeger@de.ibm.com> Reviewed-by: David Hildenbrand <david@redhat.com> Acked-by: Cornelia Huck <cohuck@redhat.com> Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com>
-
Michael Mueller authored
The GISA format facility is required by the host to be able to process a format-1 GISA. If not available, the used GISA format will be format-0. All format-1 related extension will not be available in this case. Signed-off-by: Michael Mueller <mimu@linux.vnet.ibm.com> Reviewed-by: Pierre Morel <pmorel@linux.vnet.ibm.com> Reviewed-by: Halil Pasic <pasic@linux.vnet.ibm.com> Reviewed-by: Christian Borntraeger <borntraeger@de.ibm.com> Acked-by: David Hildenbrand <david@redhat.com> Acked-by: Martin Schwidefsky <schwidefsky@de.ibm.com> Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com>
-
Michael Mueller authored
If the AIV facility is available, a GISA will be used to manage emulated adapter interrupts. Signed-off-by: Michael Mueller <mimu@linux.vnet.ibm.com> Reviewed-by: Halil Pasic <pasic@linux.vnet.ibm.com> Reviewed-by: Christian Borntraeger <borntraeger@de.ibm.com> Reviewed-by: David Hildenbrand <david@redhat.com> Reviewed-by: Cornelia Huck <cohuck@redhat.com> Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com>
-
Michael Mueller authored
The function returns a pending I/O interrupt with the highest priority defined by its ISC. Together with AIV activation, pending adapter interrupts are managed by the GISA IPM. Thus kvm_s390_get_io_int() needs to inspect the IPM as well when the interrupt with the highest priority has to be identified. In case classic and adapter interrupts with the same ISC are pending, the classic interrupt will be returned first. Signed-off-by: Michael Mueller <mimu@linux.vnet.ibm.com> Reviewed-by: Halil Pasic <pasic@linux.vnet.ibm.com> Reviewed-by: Christian Borntraeger <borntraeger@de.ibm.com> Reviewed-by: Cornelia Huck <cohuck@redhat.com> Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com>
-
Michael Mueller authored
Pending interrupts marked in the GISA IPM are required to become part of the answer of ioctl KVM_DEV_FLIC_GET_ALL_IRQS. The ioctl KVM_DEV_FLIC_ENQUEUE is already capable to enqueue adapter interrupts when a GISA is present. With ioctl KVM_DEV_FLIC_CLEAR_IRQS the GISA IPM wil be cleared now as well. Signed-off-by: Michael Mueller <mimu@linux.vnet.ibm.com> Reviewed-by: Halil Pasic <pasic@linux.vnet.ibm.com> Reviewed-by: Pierre Morel <pmorel@linux.vnet.ibm.com> Reviewed-by: Christian Borntraeger <borntraeger@de.ibm.com> Reviewed-by: Cornelia Huck <cohuck@redhat.com> Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com>
-
Michael Mueller authored
The function isc_to_int_word() allows the generation of interruption words for adapter interrupts. Signed-off-by: Michael Mueller <mimu@linux.vnet.ibm.com> Reviewed-by: Christian Borntraeger <borntraeger@de.ibm.com> Reviewed-by: Cornelia Huck <cohuck@redhat.com> Reviewed-by: David Hildenbrand <david@redhat.com> Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com>
-
Michael Mueller authored
The adapter interruption virtualization (AIV) facility is an optional facility that comes with functionality expected to increase the performance of adapter interrupt handling for both emulated and passed-through adapter interrupts. With AIV, adapter interrupts can be delivered to the guest without exiting SIE. This patch provides some preparations for using AIV for emulated adapter interrupts (including virtio) if it's available. When using AIV, the interrupts are delivered at the so called GISA by setting the bit corresponding to its Interruption Subclass (ISC) in the Interruption Pending Mask (IPM) instead of inserting a node into the floating interrupt list. To keep the change reasonably small, the handling of this new state is deferred in get_all_floating_irqs and handle_tpi. This patch concentrates on the code handling enqueuement of emulated adapter interrupts, and their delivery to the guest. Note that care is still required for adapter interrupts using AIV, because there is no guarantee that AIV is going to deliver the adapter interrupts pending at the GISA (consider all vcpus idle). When delivering GISA adapter interrupts by the host (usual mechanism) special attention is required to honor interrupt priorities. Empirical results show that the time window between making an interrupt pending at the GISA and doing kvm_s390_deliver_pending_interrupts is sufficient for a guest with at least moderate cpu activity to get adapter interrupts delivered within the SIE, and potentially save some SIE exits (if not other deliverable interrupts). The code will be activated with a follow-up patch. Signed-off-by: Michael Mueller <mimu@linux.vnet.ibm.com> Acked-by: Christian Borntraeger <borntraeger@de.ibm.com> Reviewed-by: David Hildenbrand <david@redhat.com> Reviewed-by: Cornelia Huck <cohuck@redhat.com> Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com>
-
Michael Mueller authored
The patch adds an indication for the presence Adapter Interruption Virtualization facility (AIV) of the general channel subsystem characteristics. Signed-off-by: Michael Mueller <mimu@linux.vnet.ibm.com> Reviewed-by: Halil Pasic <pasic@linux.vnet.ibm.com> Reviewed-by: Christian Borntraeger <borntraeger@de.ibm.com> Acked-by: Cornelia Huck <cohuck@redhat.com> Acked-by: Martin Schwidefsky <schwidefsky@de.ibm.com> Reviewed-by: David Hildenbrand <david@redhat.com> Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com> [change wording]
-
Michael Mueller authored
The patch implements routines to access the GISA to test and modify its Interruption Pending Mask (IPM) from the host side. Signed-off-by: Michael Mueller <mimu@linux.vnet.ibm.com> Reviewed-by: Pierre Morel <pmorel@linux.vnet.ibm.com> Reviewed-by: Halil Pasic <pasic@linux.vnet.ibm.com> Reviewed-by: Christian Borntraeger <borntraeger@de.ibm.com> Reviewed-by: David Hildenbrand <david@redhat.com> Reviewed-by: Cornelia Huck <cohuck@redhat.com> Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com>
-
Jens Freimann authored
This patch adds a MSB0 bit numbering version of test_and_clear_bit(). Signed-off-by: Jens Freimann <jfrei@linux.vnet.ibm.com> Signed-off-by: Michael Mueller <mimu@linux.vnet.ibm.com> Reviewed-by: Pierre Morel <pmorel@linux.vnet.ibm.com> Reviewed-by: Halil Pasic <pasic@linux.vnet.ibm.com> Reviewed-by: Christian Borntraeger <borntraeger@de.ibm.com> Reviewed-by: David Hildenbrand <david@redhat.com> Reviewed-by: Cornelia Huck <cohuck@redhat.com> Acked-by: Martin Schwidefsky <schwidefsky@de.ibm.com> Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com>
-
Michael Mueller authored
In preperation to support pass-through adapter interrupts, the Guest Interruption State Area (GISA) and the Adapter Interruption Virtualization (AIV) features will be introduced here. This patch introduces format-0 GISA (that is defines the struct describing the GISA, allocates storage for it, and introduces fields for the GISA address in kvm_s390_sie_block and kvm_s390_vsie). As the GISA requires storage below 2GB, it is put in sie_page2, which is already allocated in ZONE_DMA. In addition, The GISA requires alignment to its integral boundary. This is already naturally aligned via the padding in the sie_page2. Signed-off-by: Michael Mueller <mimu@linux.vnet.ibm.com> Reviewed-by: Pierre Morel <pmorel@linux.vnet.ibm.com> Reviewed-by: Halil Pasic <pasic@linux.vnet.ibm.com> Reviewed-by: Christian Borntraeger <borntraeger@de.ibm.com> Reviewed-by: David Hildenbrand <david@redhat.com> Acked-by: Cornelia Huck <cohuck@redhat.com> Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com>
-
Michael Mueller authored
This patch prepares a simplification of bit operations between the irq pending mask for emulated interrupts and the Interruption Pending Mask (IPM) which is part of the Guest Interruption State Area (GISA), a feature that allows interrupt delivery to guests by means of the SIE instruction. Without that change, a bit-wise *or* operation on parts of these two masks would either require a look-up table of size 256 bytes to map the IPM to the emulated irq pending mask bit orientation (all bits mirrored at half byte) or a sequence of up to 8 condidional branches to perform tests of single bit positions. Both options are to be rejected either by performance or space utilization reasons. Beyond that this change will be transparent. Signed-off-by: Michael Mueller <mimu@linux.vnet.ibm.com> Reviewed-by: Halil Pasic <pasic@linux.vnet.ibm.com> Reviewed-by: Pierre Morel <pmorel@linux.vnet.ibm.com> Reviewed-by: Christian Borntraeger <borntraeger@de.ibm.com> Reviewed-by: Cornelia Huck <cohuck@redhat.com> Reviewed-by: David Hildenbrand <david@redhat.com> Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com>
-
- 24 Jan, 2018 9 commits
-
-
David Hildenbrand authored
Use it just like kvm_s390_set_cpuflags() and kvm_s390_clear_cpuflags(). Signed-off-by: David Hildenbrand <david@redhat.com> Message-Id: <20180123170531.13687-5-david@redhat.com> Reviewed-by: Thomas Huth <thuth@redhat.com> Reviewed-by: Cornelia Huck <cohuck@redhat.com> Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com>
-
David Hildenbrand authored
Use it just like kvm_s390_set_cpuflags(). Suggested-by: Cornelia Huck <cohuck@redhat.com> Signed-off-by: David Hildenbrand <david@redhat.com> Message-Id: <20180123170531.13687-4-david@redhat.com> Reviewed-by: Thomas Huth <thuth@redhat.com> Reviewed-by: Cornelia Huck <cohuck@redhat.com> Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com>
-
David Hildenbrand authored
Use it in all places where we set cpuflags. Signed-off-by: David Hildenbrand <david@redhat.com> Message-Id: <20180123170531.13687-3-david@redhat.com> Reviewed-by: Thomas Huth <thuth@redhat.com> Reviewed-by: Cornelia Huck <cohuck@redhat.com> Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com>
-
David Hildenbrand authored
No need to make this function special. Move it to a header right away. Signed-off-by: David Hildenbrand <david@redhat.com> Message-Id: <20180123170531.13687-2-david@redhat.com> Reviewed-by: Thomas Huth <thuth@redhat.com> Reviewed-by: Cornelia Huck <cohuck@redhat.com> Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com>
-
Christian Borntraeger authored
The overall instruction counter is larger than the sum of the single counters. We should try to catch all instruction handlers to make this match the summary counter. Let us add sck,tb,sske,iske,rrbe,tb,tpi,tsch,lpsw,pswe.... and remove other unused ones. Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com> Acked-by: Janosch Frank <frankja@linux.vnet.ibm.com> Reviewed-by: David Hildenbrand <david@redhat.com>
-
Christian Borntraeger authored
Make the diagnose counters also appear as instruction counters. Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com> Reviewed-by: Janosch Frank <frankja@linux.vnet.ibm.com> Reviewed-by: David Hildenbrand <david@redhat.com> Reviewed-by: Cornelia Huck <cohuck@redhat.com>
-
David Hildenbrand authored
We never call it with anything but PROT_READ. This is a left over from an old prototype. For creation of shadow page tables, we always only have to protect the original table in guest memory from write accesses, so we can properly invalidate the shadow on writes. Other protections are not needed. Signed-off-by: David Hildenbrand <david@redhat.com> Message-Id: <20180123212618.32611-1-david@redhat.com> Reviewed-by: Janosch Frank <frankja@linux.vnet.ibm.com> Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com>
-
David Hildenbrand authored
This way, the values cannot change, even if another VCPU might try to mess with the nested SCB currently getting executed by another VCPU. We now always use the same gpa for pinning and unpinning a page (for unpinning, it is only relevant to mark the guest page dirty for migration). Signed-off-by: David Hildenbrand <david@redhat.com> Message-Id: <20180116171526.12343-3-david@redhat.com> Reviewed-by: Christian Borntraeger <borntraeger@de.ibm.com> Acked-by: Cornelia Huck <cohuck@redhat.com> Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com>
-
David Hildenbrand authored
Another VCPU might try to modify the SCB while we are creating the shadow SCB. In general this is no problem - unless the compiler decides to not load values once, but e.g. twice. For us, this is only relevant when checking/working with such values. E.g. the prefix value, the mso, state of transactional execution and addresses of satellite blocks. E.g. if we blindly forward values (e.g. general purpose registers or execution controls after masking), we don't care. Leaving unpin_blocks() untouched for now, will handle it separately. The worst thing right now that I can see would be a missed prefix un/remap (mso, prefix, tx) or using wrong guest addresses. Nothing critical, but let's try to avoid unpredictable behavior. Signed-off-by: David Hildenbrand <david@redhat.com> Message-Id: <20180116171526.12343-2-david@redhat.com> Reviewed-by: Christian Borntraeger <borntraeger@de.ibm.com> Acked-by: Cornelia Huck <cohuck@redhat.com> Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com>
-
- 23 Jan, 2018 1 commit
-
-
Janosch Frank authored
It seems it hasn't even been used before the last cleanup and was overlooked. Signed-off-by: Janosch Frank <frankja@linux.vnet.ibm.com> Message-Id: <1513169613-13509-12-git-send-email-frankja@linux.vnet.ibm.com> Reviewed-by: David Hildenbrand <david@redhat.com> Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com>
-
- 16 Jan, 2018 16 commits
-
-
Paolo Bonzini authored
Remove duplicate expression in nested_vmx_prepare_msr_bitmap, and make the register names clearer in hardware_setup. Suggested-by: Jim Mattson <jmattson@google.com> Reviewed-by: Jim Mattson <jmattson@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> [Resolved rebase conflict after removing Intel PT. - Radim] Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
-
Paolo Bonzini authored
The bulk of the MSR bitmap is either immutable, or can be copied from the L1 bitmap. By initializing it at VMXON time, and copying the mutable parts one long at a time on vmentry (rather than one bit), about 4000 clock cycles (30%) can be saved on a nested VMLAUNCH/VMRESUME. The resulting for loop only has four iterations, so it is cheap enough to reinitialize the MSR write bitmaps on every iteration, and it makes the code simpler. Suggested-by: Jim Mattson <jmattson@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
-
Paolo Bonzini authored
The APICv-enabled MSR bitmap is a superset of the APICv-disabled bitmap. Make that obvious in vmx_disable_intercept_msr_x2apic. Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> [Resolved rebase conflict after removing Intel PT. - Radim] Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
-
Paolo Bonzini authored
The POSTED_INTR_NV field is constant (though it differs between the vmcs01 and vmcs02), there is no need to reload it on vmexit to L1. Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
-
Paolo Bonzini authored
These fields are also simple copies of the data in the vmcs12 struct. For some of them, prepare_vmcs02 was skipping the copy when the field was unused. In prepare_vmcs02_full, we copy them always as long as the field exists on the host, because the corresponding execution control might be one of the shadowed fields. Optimization opportunities remain for MSRs that, depending on the entry/exit controls, have to be copied from either the vmcs01 or the vmcs12: EFER (whose value is partly stored in the entry controls too), PAT, DEBUGCTL (and also DR7). Before moving these three and the entry/exit controls to prepare_vmcs02_full, KVM would have to set dirty_vmcs12 on writes to the L1 MSRs. Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
-
Paolo Bonzini authored
This part is separate for ease of review, because git prefers to move prepare_vmcs02 below the initial long sequence of vmcs_write* operations. Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
-
Paolo Bonzini authored
VMCS12 fields that are not handled through shadow VMCS are rarely written, and thus they are also almost constant in the vmcs02. We can thus optimize prepare_vmcs02 by skipping all the work for non-shadowed fields in the common case. This patch introduces the (pretty simple) tracking infrastructure; the next patches will move work to prepare_vmcs02_full and save a few hundred clock cycles per VMRESUME on a Haswell Xeon E5 system: before after cpuid 14159 13869 vmcall 15290 14951 inl_from_kernel 17703 17447 outl_to_kernel 16011 14692 self_ipi_sti_nop 16763 15825 self_ipi_tpr_sti_nop 17341 15935 wr_tsc_adjust_msr 14510 14264 rd_tsc_adjust_msr 15018 14311 mmio-wildcard-eventfd:pci-mem 16381 14947 mmio-datamatch-eventfd:pci-mem 18620 17858 portio-wildcard-eventfd:pci-io 15121 14769 portio-datamatch-eventfd:pci-io 15761 14831 (average savings 748, stdev 460). Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
-
Paolo Bonzini authored
Prepare for multiple inclusions of the list. Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
-
Jim Mattson authored
The vmcs_field_to_offset_table was a rather sparse table of short integers with a maximum index of 0x6c16, amounting to 55342 bytes. Now that we are considering support for multiple VMCS12 formats, it would be unfortunate to replicate that large, sparse table. Rotating the field encoding (as a 16-bit integer) left by 6 reduces that table to 5926 bytes. Suggested-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Jim Mattson <jmattson@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
-
Jim Mattson authored
Per the SDM, "[VMCS] Fields are grouped by width (16-bit, 32-bit, etc.) and type (guest-state, host-state, etc.)." Previously, the width was indicated by vmcs_field_type. To avoid confusion when we start dealing with both field width and field type, change vmcs_field_type to vmcs_field_width. Signed-off-by: Jim Mattson <jmattson@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
-
Jim Mattson authored
This is the highest index value used in any supported VMCS12 field encoding. It is used to populate the IA32_VMX_VMCS_ENUM MSR. Signed-off-by: Jim Mattson <jmattson@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
-
Paolo Bonzini authored
Because all fields can be read/written with a single vmread/vmwrite on 64-bit kernels, the switch statements in copy_vmcs12_to_shadow and copy_shadow_to_vmcs12 are unnecessary. What I did in this patch is to copy the two parts of 64-bit fields separately on 32-bit kernels, to keep all complicated #ifdef-ery in init_vmcs_shadow_fields. The disadvantage is that 64-bit fields have to be listed separately in shadow_read_only/read_write_fields, but those are few and we can validate the arrays when building the VMREAD and VMWRITE bitmaps. This saves a few hundred clock cycles per nested vmexit. However there is still a "switch" in vmcs_read_any and vmcs_write_any. So, while at it, this patch reorders the fields by type, hoping that the branch predictor appreciates it. Cc: Jim Mattson <jmattson@google.com> Cc: Wanpeng Li <wanpeng.li@hotmail.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
-
Paolo Bonzini authored
Compared to when VMCS shadowing was added to KVM, we are reading/writing a few more fields: the PML index, the interrupt status and the preemption timer value. The first two are because we are exposing more features to nested guests, the preemption timer is simply because we have grown a new optimization. Adding them to the shadow VMCS field lists reduces the cost of a vmexit by about 1000 clock cycles for each field that exists on bare metal. On the other hand, the guest BNDCFGS and TSC offset are not written on fast paths, so remove them. Suggested-by: Jim Mattson <jmattson@google.com> Cc: Jim Mattson <jmattson@google.com> Cc: Wanpeng Li <wanpeng.li@hotmail.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
-
git://git.kernel.org/pub/scm/linux/kernel/git/kvms390/linuxRadim Krčmář authored
KVM: s390: Fixes and features for 4.16 - add the virtio-ccw transport for kvmconfig - more debug tracing for cpu model - cleanups and fixes
-
Liran Alon authored
Consider the following scenario: 1. CPU A calls vmx_deliver_nested_posted_interrupt() to send an IPI to CPU B via virtual posted-interrupt mechanism. 2. CPU B is currently executing L2 guest. 3. vmx_deliver_nested_posted_interrupt() calls kvm_vcpu_trigger_posted_interrupt() which will note that vcpu->mode == IN_GUEST_MODE. 4. Assume that before CPU A sends the physical POSTED_INTR_NESTED_VECTOR IPI, CPU B exits from L2 to L0 during event-delivery (valid IDT-vectoring-info). 5. CPU A now sends the physical IPI. The IPI is received in host and it's handler (smp_kvm_posted_intr_nested_ipi()) does nothing. 6. Assume that before CPU A sets pi_pending=true and KVM_REQ_EVENT, CPU B continues to run in L0 and reach vcpu_enter_guest(). As KVM_REQ_EVENT is not set yet, vcpu_enter_guest() will continue and resume L2 guest. 7. At this point, CPU A sets pi_pending=true and KVM_REQ_EVENT but it's too late! CPU B already entered L2 and KVM_REQ_EVENT will only be consumed at next L2 entry! Another scenario to consider: 1. CPU A calls vmx_deliver_nested_posted_interrupt() to send an IPI to CPU B via virtual posted-interrupt mechanism. 2. Assume that before CPU A calls kvm_vcpu_trigger_posted_interrupt(), CPU B is at L0 and is about to resume into L2. Further assume that it is in vcpu_enter_guest() after check for KVM_REQ_EVENT. 3. At this point, CPU A calls kvm_vcpu_trigger_posted_interrupt() which will note that vcpu->mode != IN_GUEST_MODE. Therefore, do nothing and return false. Then, will set pi_pending=true and KVM_REQ_EVENT. 4. Now CPU B continue and resumes into L2 guest without processing the posted-interrupt until next L2 entry! To fix both issues, we just need to change vmx_deliver_nested_posted_interrupt() to set pi_pending=true and KVM_REQ_EVENT before calling kvm_vcpu_trigger_posted_interrupt(). It will fix the first scenario by chaging step (6) to note that KVM_REQ_EVENT and pi_pending=true and therefore process nested posted-interrupt. It will fix the second scenario by two possible ways: 1. If kvm_vcpu_trigger_posted_interrupt() is called while CPU B has changed vcpu->mode to IN_GUEST_MODE, physical IPI will be sent and will be received when CPU resumes into L2. 2. If kvm_vcpu_trigger_posted_interrupt() is called while CPU B hasn't yet changed vcpu->mode to IN_GUEST_MODE, then after CPU B will change vcpu->mode it will call kvm_request_pending() which will return true and therefore force another round of vcpu_enter_guest() which will note that KVM_REQ_EVENT and pi_pending=true and therefore process nested posted-interrupt. Cc: stable@vger.kernel.org Fixes: 705699a1 ("KVM: nVMX: Enable nested posted interrupt processing") Signed-off-by: Liran Alon <liran.alon@oracle.com> Reviewed-by: Nikita Leshenko <nikita.leshchenko@oracle.com> Reviewed-by: Krish Sadhukhan <krish.sadhukhan@oracle.com> [Add kvm_vcpu_kick to also handle the case where L1 doesn't intercept L2 HLT and L2 executes HLT instruction. - Paolo] Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
-
Liran Alon authored
Before each vmentry to guest, vcpu_enter_guest() calls sync_pir_to_irr() which calls vmx_hwapic_irr_update() to update RVI. Currently, vmx_hwapic_irr_update() contains a tweak in case it is called when CPU is running L2 and L1 don't intercept external-interrupts. In that case, code injects interrupt directly into L2 instead of updating RVI. Besides being hacky (wouldn't expect function updating RVI to also inject interrupt), it also doesn't handle this case correctly. The code contains several issues: 1. When code calls kvm_queue_interrupt() it just passes it max_irr which represents the highest IRR currently pending in L1 LAPIC. This is problematic as interrupt was injected to guest but it's bit is still set in LAPIC IRR instead of being cleared from IRR and set in ISR. 2. Code doesn't check if LAPIC PPR is set to accept an interrupt of max_irr priority. It just checks if interrupts are enabled in guest with vmx_interrupt_allowed(). To fix the above issues: 1. Simplify vmx_hwapic_irr_update() to just update RVI. Note that this shouldn't happen when CPU is running L2 (See comment in code). 2. Since now vmx_hwapic_irr_update() only does logic for L1 virtual-interrupt-delivery, inject_pending_event() should be the one responsible for injecting the interrupt directly into L2. Therefore, change kvm_cpu_has_injectable_intr() to check L1 LAPIC when CPU is running L2. 3. Change vmx_sync_pir_to_irr() to set KVM_REQ_EVENT when L1 has a pending injectable interrupt. Fixes: 963fee16 ("KVM: nVMX: Fix virtual interrupt delivery injection") Signed-off-by: Liran Alon <liran.alon@oracle.com> Reviewed-by: Nikita Leshenko <nikita.leshchenko@oracle.com> Reviewed-by: Krish Sadhukhan <krish.sadhukhan@oracle.com> Reviewed-by: Liam Merwick <liam.merwick@oracle.com> Signed-off-by: Liam Merwick <liam.merwick@oracle.com> Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
-