1. 09 Dec, 2013 4 commits
    • Alexander Graf's avatar
      KVM: PPC: Book3S: PR: Enable interrupts earlier · 3d3319b4
      Alexander Graf authored
      Now that the svcpu sync is interrupt aware we can enable interrupts
      earlier in the exit code path again, moving 32bit and 64bit closer
      together.
      
      While at it, document the fact that we're always executing the exit
      path with interrupts enabled so that the next person doesn't trap
      over this.
      Signed-off-by: default avatarAlexander Graf <agraf@suse.de>
      3d3319b4
    • Alexander Graf's avatar
      KVM: PPC: Book3S: PR: Make svcpu -> vcpu store preempt savvy · 40fdd8c8
      Alexander Graf authored
      As soon as we get back to our "highmem" handler in virtual address
      space we may get preempted. Today the reason we can get preempted is
      that we replay interrupts and all the lazy logic thinks we have
      interrupts enabled.
      
      However, it's not hard to make the code interruptible and that way
      we can enable and handle interrupts even earlier.
      
      This fixes random guest crashes that happened with CONFIG_PREEMPT=y
      for me.
      Signed-off-by: default avatarAlexander Graf <agraf@suse.de>
      40fdd8c8
    • Alexander Graf's avatar
      KVM: PPC: Book3S: PR: Export kvmppc_copy_to|from_svcpu · c9dad7f9
      Alexander Graf authored
      The kvmppc_copy_{to,from}_svcpu functions are publically visible,
      so we should also export them in a header for others C files to
      consume.
      
      So far we didn't need this because we only called it from asm code.
      The next patch will introduce a C caller.
      Signed-off-by: default avatarAlexander Graf <agraf@suse.de>
      c9dad7f9
    • Alexander Graf's avatar
      KVM: PPC: Book3S: PR: Don't clobber our exit handler id · d825a043
      Alexander Graf authored
      We call a C helper to save all svcpu fields into our vcpu. The C
      ABI states that r12 is considered volatile. However, we keep our
      exit handler id in r12 currently.
      
      So we need to save it away into a non-volatile register instead
      that definitely does get preserved across the C call.
      
      This bug usually didn't hit anyone yet since gcc is smart enough
      to generate code that doesn't even need r12 which means it stayed
      identical throughout the call by sheer luck. But we can't rely on
      that.
      Signed-off-by: default avatarAlexander Graf <agraf@suse.de>
      d825a043
  2. 18 Nov, 2013 5 commits
    • pingfan liu's avatar
      powerpc: kvm: fix rare but potential deadlock scene · 91648ec0
      pingfan liu authored
      Since kvmppc_hv_find_lock_hpte() is called from both virtmode and
      realmode, so it can trigger the deadlock.
      
      Suppose the following scene:
      
      Two physical cpuM, cpuN, two VM instances A, B, each VM has a group of
      vcpus.
      
      If on cpuM, vcpu_A_1 holds bitlock X (HPTE_V_HVLOCK), then is switched
      out, and on cpuN, vcpu_A_2 try to lock X in realmode, then cpuN will be
      caught in realmode for a long time.
      
      What makes things even worse if the following happens,
        On cpuM, bitlockX is hold, on cpuN, Y is hold.
        vcpu_B_2 try to lock Y on cpuM in realmode
        vcpu_A_2 try to lock X on cpuN in realmode
      
      Oops! deadlock happens
      Signed-off-by: default avatarLiu Ping Fan <pingfank@linux.vnet.ibm.com>
      Reviewed-by: default avatarPaul Mackerras <paulus@samba.org>
      CC: stable@vger.kernel.org
      Signed-off-by: default avatarAlexander Graf <agraf@suse.de>
      91648ec0
    • Paul Mackerras's avatar
      KVM: PPC: Book3S HV: Take SRCU read lock around kvm_read_guest() call · c9438092
      Paul Mackerras authored
      Running a kernel with CONFIG_PROVE_RCU=y yields the following diagnostic:
      
      ===============================
      [ INFO: suspicious RCU usage. ]
      3.12.0-rc5-kvm+ #9 Not tainted
      -------------------------------
      
      include/linux/kvm_host.h:473 suspicious rcu_dereference_check() usage!
      
      other info that might help us debug this:
      
      rcu_scheduler_active = 1, debug_locks = 0
      1 lock held by qemu-system-ppc/4831:
      
      stack backtrace:
      CPU: 28 PID: 4831 Comm: qemu-system-ppc Not tainted 3.12.0-rc5-kvm+ #9
      Call Trace:
      [c000000be462b2a0] [c00000000001644c] .show_stack+0x7c/0x1f0 (unreliable)
      [c000000be462b370] [c000000000ad57c0] .dump_stack+0x88/0xb4
      [c000000be462b3f0] [c0000000001315e8] .lockdep_rcu_suspicious+0x138/0x180
      [c000000be462b480] [c00000000007862c] .gfn_to_memslot+0x13c/0x170
      [c000000be462b510] [c00000000007d384] .gfn_to_hva_prot+0x24/0x90
      [c000000be462b5a0] [c00000000007d420] .kvm_read_guest_page+0x30/0xd0
      [c000000be462b630] [c00000000007d528] .kvm_read_guest+0x68/0x110
      [c000000be462b6e0] [c000000000084594] .kvmppc_rtas_hcall+0x34/0x180
      [c000000be462b7d0] [c000000000097934] .kvmppc_pseries_do_hcall+0x74/0x830
      [c000000be462b880] [c0000000000990e8] .kvmppc_vcpu_run_hv+0xff8/0x15a0
      [c000000be462b9e0] [c0000000000839cc] .kvmppc_vcpu_run+0x2c/0x40
      [c000000be462ba50] [c0000000000810b4] .kvm_arch_vcpu_ioctl_run+0x54/0x1b0
      [c000000be462bae0] [c00000000007b508] .kvm_vcpu_ioctl+0x478/0x730
      [c000000be462bca0] [c00000000025532c] .do_vfs_ioctl+0x4dc/0x7a0
      [c000000be462bd80] [c0000000002556b4] .SyS_ioctl+0xc4/0xe0
      [c000000be462be30] [c000000000009ee4] syscall_exit+0x0/0x98
      
      To fix this, we take the SRCU read lock around the kvmppc_rtas_hcall()
      call.
      Signed-off-by: default avatarPaul Mackerras <paulus@samba.org>
      Signed-off-by: default avatarAlexander Graf <agraf@suse.de>
      c9438092
    • Paul Mackerras's avatar
      KVM: PPC: Book3S HV: Make tbacct_lock irq-safe · bf3d32e1
      Paul Mackerras authored
      Lockdep reported that there is a potential for deadlock because
      vcpu->arch.tbacct_lock is not irq-safe, and is sometimes taken inside
      the rq_lock (run-queue lock) in the scheduler, which is taken within
      interrupts.  The lockdep splat looks like:
      
      ======================================================
      [ INFO: HARDIRQ-safe -> HARDIRQ-unsafe lock order detected ]
      3.12.0-rc5-kvm+ #8 Not tainted
      ------------------------------------------------------
      qemu-system-ppc/4803 [HC0[0]:SC0[0]:HE0:SE1] is trying to acquire:
      (&(&vcpu->arch.tbacct_lock)->rlock){+.+...}, at: [<c0000000000947ac>] .kvmppc_core_vcpu_put_hv+0x2c/0xa0
      
      and this task is already holding:
      (&rq->lock){-.-.-.}, at: [<c000000000ac16c0>] .__schedule+0x180/0xaa0
      which would create a new lock dependency:
      (&rq->lock){-.-.-.} -> (&(&vcpu->arch.tbacct_lock)->rlock){+.+...}
      
      but this new dependency connects a HARDIRQ-irq-safe lock:
      (&rq->lock){-.-.-.}
      ... which became HARDIRQ-irq-safe at:
       [<c00000000013797c>] .lock_acquire+0xbc/0x190
       [<c000000000ac3c74>] ._raw_spin_lock+0x34/0x60
       [<c0000000000f8564>] .scheduler_tick+0x54/0x180
       [<c0000000000c2610>] .update_process_times+0x70/0xa0
       [<c00000000012cdfc>] .tick_periodic+0x3c/0xe0
       [<c00000000012cec8>] .tick_handle_periodic+0x28/0xb0
       [<c00000000001ef40>] .timer_interrupt+0x120/0x2e0
       [<c000000000002868>] decrementer_common+0x168/0x180
       [<c0000000001c7ca4>] .get_page_from_freelist+0x924/0xc10
       [<c0000000001c8e00>] .__alloc_pages_nodemask+0x200/0xba0
       [<c0000000001c9eb8>] .alloc_pages_exact_nid+0x68/0x110
       [<c000000000f4c3ec>] .page_cgroup_init+0x1e0/0x270
       [<c000000000f24480>] .start_kernel+0x3e0/0x4e4
       [<c000000000009d30>] .start_here_common+0x20/0x70
      
      to a HARDIRQ-irq-unsafe lock:
      (&(&vcpu->arch.tbacct_lock)->rlock){+.+...}
      ... which became HARDIRQ-irq-unsafe at:
      ...  [<c00000000013797c>] .lock_acquire+0xbc/0x190
       [<c000000000ac3c74>] ._raw_spin_lock+0x34/0x60
       [<c0000000000946ac>] .kvmppc_core_vcpu_load_hv+0x2c/0x100
       [<c00000000008394c>] .kvmppc_core_vcpu_load+0x2c/0x40
       [<c000000000081000>] .kvm_arch_vcpu_load+0x10/0x30
       [<c00000000007afd4>] .vcpu_load+0x64/0xd0
       [<c00000000007b0f8>] .kvm_vcpu_ioctl+0x68/0x730
       [<c00000000025530c>] .do_vfs_ioctl+0x4dc/0x7a0
       [<c000000000255694>] .SyS_ioctl+0xc4/0xe0
       [<c000000000009ee4>] syscall_exit+0x0/0x98
      
      Some users have reported this deadlock occurring in practice, though
      the reports have been primarily on 3.10.x-based kernels.
      
      This fixes the problem by making tbacct_lock be irq-safe.
      Signed-off-by: default avatarPaul Mackerras <paulus@samba.org>
      Signed-off-by: default avatarAlexander Graf <agraf@suse.de>
      bf3d32e1
    • Paul Mackerras's avatar
      KVM: PPC: Book3S HV: Refine barriers in guest entry/exit · f019b7ad
      Paul Mackerras authored
      Some users have reported instances of the host hanging with secondary
      threads of a core waiting for the primary thread to exit the guest,
      and the primary thread stuck in nap mode.  This prompted a review of
      the memory barriers in the guest entry/exit code, and this is the
      result.  Most of these changes are the suggestions of Dean Burdick
      <deanburdick@us.ibm.com>.
      
      The barriers between updating napping_threads and reading the
      entry_exit_count on the one hand, and updating entry_exit_count and
      reading napping_threads on the other, need to be isync not lwsync,
      since we need to ensure that either the napping_threads update or the
      entry_exit_count update get seen.  It is not sufficient to order the
      load vs. lwarx, as lwsync does; we need to order the load vs. the
      stwcx., so we need isync.
      
      In addition, we need a full sync before sending IPIs to wake other
      threads from nap, to ensure that the write to the entry_exit_count is
      visible before the IPI occurs.
      Signed-off-by: default avatarPaul Mackerras <paulus@samba.org>
      Signed-off-by: default avatarAlexander Graf <agraf@suse.de>
      f019b7ad
    • Paul Mackerras's avatar
      KVM: PPC: Book3S HV: Fix physical address calculations · caaa4c80
      Paul Mackerras authored
      This fixes a bug in kvmppc_do_h_enter() where the physical address
      for a page can be calculated incorrectly if transparent huge pages
      (THP) are active.  Until THP came along, it was true that if we
      encountered a large (16M) page in kvmppc_do_h_enter(), then the
      associated memslot must be 16M aligned for both its guest physical
      address and the userspace address, and the physical address
      calculations in kvmppc_do_h_enter() assumed that.  With THP, that
      is no longer true.
      
      In the case where we are using MMU notifiers and the page size that
      we get from the Linux page tables is larger than the page being mapped
      by the guest, we need to fill in some low-order bits of the physical
      address.  Without THP, these bits would be the same in the guest
      physical address (gpa) and the host virtual address (hva).  With THP,
      they can be different, and we need to use the bits from hva rather
      than gpa.
      
      In the case where we are not using MMU notifiers, the host physical
      address we get from the memslot->arch.slot_phys[] array already
      includes the low-order bits down to the PAGE_SIZE level, even if
      we are using large pages.  Thus we can simplify the calculation in
      this case to just add in the remaining bits in the case where
      PAGE_SIZE is 64k and the guest is mapping a 4k page.
      
      The same bug exists in kvmppc_book3s_hv_page_fault().  The basic fix
      is to use psize (the page size from the HPTE) rather than pte_size
      (the page size from the Linux PTE) when updating the HPTE low word
      in r.  That means that pfn needs to be computed to PAGE_SIZE
      granularity even if the Linux PTE is a huge page PTE.  That can be
      arranged simply by doing the page_to_pfn() before setting page to
      the head of the compound page.  If psize is less than PAGE_SIZE,
      then we need to make sure we only update the bits from PAGE_SIZE
      upwards, in order not to lose any sub-page offset bits in r.
      On the other hand, if psize is greater than PAGE_SIZE, we need to
      make sure we don't bring in non-zero low order bits in pfn, hence
      we mask (pfn << PAGE_SHIFT) with ~(psize - 1).
      Signed-off-by: default avatarPaul Mackerras <paulus@samba.org>
      Signed-off-by: default avatarAlexander Graf <agraf@suse.de>
      caaa4c80
  3. 15 Nov, 2013 31 commits