1. 03 Sep, 2014 5 commits
    • Paolo Bonzini's avatar
      KVM: x86: reserve bit 8 of non-leaf PDPEs and PML4Es in 64-bit mode on AMD · a0c0feb5
      Paolo Bonzini authored
      Bit 8 would be the "global" bit, which does not quite make sense for non-leaf
      page table entries.  Intel ignores it; AMD ignores it in PDEs, but reserves it
      in PDPEs and PML4Es.  The SVM test is relying on this behavior, so enforce it.
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      a0c0feb5
    • Tiejun Chen's avatar
      KVM: mmio: cleanup kvm_set_mmio_spte_mask · d1431483
      Tiejun Chen authored
      Just reuse rsvd_bits() inside kvm_set_mmio_spte_mask()
      for slightly better code.
      Signed-off-by: default avatarTiejun Chen <tiejun.chen@intel.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      d1431483
    • David Matlack's avatar
      kvm: x86: fix stale mmio cache bug · 56f17dd3
      David Matlack authored
      The following events can lead to an incorrect KVM_EXIT_MMIO bubbling
      up to userspace:
      
      (1) Guest accesses gpa X without a memory slot. The gfn is cached in
      struct kvm_vcpu_arch (mmio_gfn). On Intel EPT-enabled hosts, KVM sets
      the SPTE write-execute-noread so that future accesses cause
      EPT_MISCONFIGs.
      
      (2) Host userspace creates a memory slot via KVM_SET_USER_MEMORY_REGION
      covering the page just accessed.
      
      (3) Guest attempts to read or write to gpa X again. On Intel, this
      generates an EPT_MISCONFIG. The memory slot generation number that
      was incremented in (2) would normally take care of this but we fast
      path mmio faults through quickly_check_mmio_pf(), which only checks
      the per-vcpu mmio cache. Since we hit the cache, KVM passes a
      KVM_EXIT_MMIO up to userspace.
      
      This patch fixes the issue by using the memslot generation number
      to validate the mmio cache.
      
      Cc: stable@vger.kernel.org
      Signed-off-by: default avatarDavid Matlack <dmatlack@google.com>
      [xiaoguangrong: adjust the code to make it simpler for stable-tree fix.]
      Signed-off-by: default avatarXiao Guangrong <xiaoguangrong@linux.vnet.ibm.com>
      Reviewed-by: default avatarDavid Matlack <dmatlack@google.com>
      Reviewed-by: default avatarXiao Guangrong <xiaoguangrong@linux.vnet.ibm.com>
      Tested-by: default avatarDavid Matlack <dmatlack@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      56f17dd3
    • David Matlack's avatar
      kvm: fix potentially corrupt mmio cache · ee3d1570
      David Matlack authored
      vcpu exits and memslot mutations can run concurrently as long as the
      vcpu does not aquire the slots mutex. Thus it is theoretically possible
      for memslots to change underneath a vcpu that is handling an exit.
      
      If we increment the memslot generation number again after
      synchronize_srcu_expedited(), vcpus can safely cache memslot generation
      without maintaining a single rcu_dereference through an entire vm exit.
      And much of the x86/kvm code does not maintain a single rcu_dereference
      of the current memslots during each exit.
      
      We can prevent the following case:
      
         vcpu (CPU 0)                             | thread (CPU 1)
      --------------------------------------------+--------------------------
      1  vm exit                                  |
      2  srcu_read_unlock(&kvm->srcu)             |
      3  decide to cache something based on       |
           old memslots                           |
      4                                           | change memslots
                                                  | (increments generation)
      5                                           | synchronize_srcu(&kvm->srcu);
      6  retrieve generation # from new memslots  |
      7  tag cache with new memslot generation    |
      8  srcu_read_unlock(&kvm->srcu)             |
      ...                                         |
         <action based on cache occurs even       |
          though the caching decision was based   |
          on the old memslots>                    |
      ...                                         |
         <action *continues* to occur until next  |
          memslot generation change, which may    |
          be never>                               |
                                                  |
      
      By incrementing the generation after synchronizing with kvm->srcu readers,
      we ensure that the generation retrieved in (6) will become invalid soon
      after (8).
      
      Keeping the existing increment is not strictly necessary, but we
      do keep it and just move it for consistency from update_memslots to
      install_new_memslots.  It invalidates old cached MMIOs immediately,
      instead of having to wait for the end of synchronize_srcu_expedited,
      which makes the code more clearly correct in case CPU 1 is preempted
      right after synchronize_srcu() returns.
      
      To avoid halving the generation space in SPTEs, always presume that the
      low bit of the generation is zero when reconstructing a generation number
      out of an SPTE.  This effectively disables MMIO caching in SPTEs during
      the call to synchronize_srcu_expedited.  Using the low bit this way is
      somewhat like a seqcount---where the protected thing is a cache, and
      instead of retrying we can simply punt if we observe the low bit to be 1.
      
      Cc: stable@vger.kernel.org
      Signed-off-by: default avatarDavid Matlack <dmatlack@google.com>
      Reviewed-by: default avatarXiao Guangrong <xiaoguangrong@linux.vnet.ibm.com>
      Reviewed-by: default avatarDavid Matlack <dmatlack@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      ee3d1570
    • Paolo Bonzini's avatar
      KVM: do not bias the generation number in kvm_current_mmio_generation · 00f034a1
      Paolo Bonzini authored
      The next patch will give a meaning (a la seqcount) to the low bit of the
      generation number.  Ensure that it matches between kvm->memslots->generation
      and kvm_current_mmio_generation().
      
      Cc: stable@vger.kernel.org
      Reviewed-by: default avatarDavid Matlack <dmatlack@google.com>
      Reviewed-by: default avatarXiao Guangrong <xiaoguangrong@linux.vnet.ibm.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      00f034a1
  2. 29 Aug, 2014 13 commits
  3. 26 Aug, 2014 4 commits
    • Paolo Bonzini's avatar
      Merge tag 'kvm-s390-next-20140825' of... · a7428c3d
      Paolo Bonzini authored
      Merge tag 'kvm-s390-next-20140825' of git://git.kernel.org/pub/scm/linux/kernel/git/kvms390/linux into HEAD
      
      KVM: s390: Fixes and features for 3.18 part 1
      
      1. The usual cleanups: get rid of duplicate code, use defines, factor
         out the sync_reg handling, additional docs for sync_regs, better
         error handling on interrupt injection
      2. We use KVM_REQ_TLB_FLUSH instead of open coding tlb flushes
      3. Additional registers for kvm_run sync regs. This is usually not
         needed in the fast path due to eventfd/irqfd, but kvm stat claims
         that we reduced the overhead of console output by ~50% on my system
      4. A rework of the gmap infrastructure. This is the 2nd step towards
         host large page support (after getting rid of the storage key
         dependency). We introduces two radix trees to store the guest-to-host
         and host-to-guest translations. This gets us rid of most of
         the page-table walks in the gmap code. Only one in __gmap_link is left,
         this one is required to link the shadow page table to the process page
         table. Finally this contains the plumbing to support gmap page tables
         with less than 5 levels.
      a7428c3d
    • Martin Schwidefsky's avatar
      KVM: s390/mm: remove outdated gmap data structures · f079e952
      Martin Schwidefsky authored
      The radix tree rework removed all code that uses the gmap_rmap
      and gmap_pgtable data structures. Remove these outdated definitions.
      Signed-off-by: default avatarMartin Schwidefsky <schwidefsky@de.ibm.com>
      Signed-off-by: default avatarChristian Borntraeger <borntraeger@de.ibm.com>
      f079e952
    • Martin Schwidefsky's avatar
      KVM: s390/mm: support gmap page tables with less than 5 levels · c6c956b8
      Martin Schwidefsky authored
      Add an addressing limit to the gmap address spaces and only allocate
      the page table levels that are needed for the given limit. The limit
      is fixed and can not be changed after a gmap has been created.
      Signed-off-by: default avatarMartin Schwidefsky <schwidefsky@de.ibm.com>
      Signed-off-by: default avatarChristian Borntraeger <borntraeger@de.ibm.com>
      c6c956b8
    • Martin Schwidefsky's avatar
      KVM: s390/mm: use radix trees for guest to host mappings · 527e30b4
      Martin Schwidefsky authored
      Store the target address for the gmap segments in a radix tree
      instead of using invalid segment table entries. gmap_translate
      becomes a simple radix_tree_lookup, gmap_fault is split into the
      address translation with gmap_translate and the part that does
      the linking of the gmap shadow page table with the process page
      table.
      A second radix tree is used to keep the pointers to the segment
      table entries for segments that are mapped in the guest address
      space. On unmap of a segment the pointer is retrieved from the
      radix tree and is used to carry out the segment invalidation in
      the gmap shadow page table. As the radix tree can only store one
      pointer, each host segment may only be mapped to exactly one
      guest location.
      Signed-off-by: default avatarMartin Schwidefsky <schwidefsky@de.ibm.com>
      Signed-off-by: default avatarChristian Borntraeger <borntraeger@de.ibm.com>
      527e30b4
  4. 25 Aug, 2014 16 commits
  5. 21 Aug, 2014 2 commits
    • Radim Krčmář's avatar
      KVM: trace kvm_ple_window grow/shrink · 7b46268d
      Radim Krčmář authored
      Tracepoint for dynamic PLE window, fired on every potential change.
      Signed-off-by: default avatarRadim Krčmář <rkrcmar@redhat.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      7b46268d
    • Radim Krčmář's avatar
      KVM: VMX: dynamise PLE window · b4a2d31d
      Radim Krčmář authored
      Window is increased on every PLE exit and decreased on every sched_in.
      The idea is that we don't want to PLE exit if there is no preemption
      going on.
      We do this with sched_in() because it does not hold rq lock.
      
      There are two new kernel parameters for changing the window:
       ple_window_grow and ple_window_shrink
      ple_window_grow affects the window on PLE exit and ple_window_shrink
      does it on sched_in;  depending on their value, the window is modifier
      like this: (ple_window is kvm_intel's global)
      
        ple_window_shrink/ |
        ple_window_grow    | PLE exit           | sched_in
        -------------------+--------------------+---------------------
        < 1                |  = ple_window      |  = ple_window
        < ple_window       | *= ple_window_grow | /= ple_window_shrink
        otherwise          | += ple_window_grow | -= ple_window_shrink
      
      A third new parameter, ple_window_max, controls the maximal ple_window;
      it is internally rounded down to a closest multiple of ple_window_grow.
      
      VCPU's PLE window is never allowed below ple_window.
      Signed-off-by: default avatarRadim Krčmář <rkrcmar@redhat.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      b4a2d31d