• Thomas Gleixner's avatar
    x86/kvm: Sanitize kvm_async_pf_task_wait() · 6bca69ad
    Thomas Gleixner authored
    While working on the entry consolidation I stumbled over the KVM async page
    fault handler and kvm_async_pf_task_wait() in particular. It took me a
    while to realize that the randomly sprinkled around rcu_irq_enter()/exit()
    invocations are just cargo cult programming. Several patches "fixed" RCU
    splats by curing the symptoms without noticing that the code is flawed 
    from a design perspective.
    
    The main problem is that this async injection is not based on a proper
    handshake mechanism and only respects the minimal requirement, i.e. the
    guest is not in a state where it has interrupts disabled.
    
    Aside of that the actual code is a convoluted one fits it all swiss army
    knife. It is invoked from different places with different RCU constraints:
    
      1) Host side:
    
         vcpu_enter_guest()
           kvm_x86_ops->handle_exit()
             kvm_handle_page_fault()
               kvm_async_pf_task_wait()
    
         The invocation happens from fully preemptible context.
    
      2) Guest side:
    
         The async page fault interrupted:
    
             a) user space
    
    	 b) preemptible kernel code which is not in a RCU read side
    	    critical section
    
         	 c) non-preemtible kernel code or a RCU read side critical section
    	    or kernel code with CONFIG_PREEMPTION=n which allows not to
    	    differentiate between #2b and #2c.
    
    RCU is watching for:
    
      #1  The vCPU exited and current is definitely not the idle task
    
      #2a The #PF entry code on the guest went through enter_from_user_mode()
          which reactivates RCU
    
      #2b There is no preemptible, interrupts enabled code in the kernel
          which can run with RCU looking away. (The idle task is always
          non preemptible).
    
    I.e. all schedulable states (#1, #2a, #2b) do not need any of this RCU
    voodoo at all.
    
    In #2c RCU is eventually not watching, but as that state cannot schedule
    anyway there is no point to worry about it so it has to invoke
    rcu_irq_enter() before running that code. This can be optimized, but this
    will be done as an extra step in course of the entry code consolidation
    work.
    
    So the proper solution for this is to:
    
      - Split kvm_async_pf_task_wait() into schedule and halt based waiting
        interfaces which share the enqueueing code.
    
      - Add comments (condensed form of this changelog) to spare others the
        time waste and pain of reverse engineering all of this with the help of
        uncomprehensible changelogs and code history.
    
      - Invoke kvm_async_pf_task_wait_schedule() from kvm_handle_page_fault(),
        user mode and schedulable kernel side async page faults (#1, #2a, #2b)
    
      - Invoke kvm_async_pf_task_wait_halt() for the non schedulable kernel
        case (#2c).
    
        For this case also remove the rcu_irq_exit()/enter() pair around the
        halt as it is just a pointless exercise:
    
           - vCPUs can VMEXIT at any random point and can be scheduled out for
             an arbitrary amount of time by the host and this is not any
             different except that it voluntary triggers the exit via halt.
    
           - The interrupted context could have RCU watching already. So the
    	 rcu_irq_exit() before the halt is not gaining anything aside of
    	 confusing the reader. Claiming that this might prevent RCU stalls
    	 is just an illusion.
    Signed-off-by: default avatarThomas Gleixner <tglx@linutronix.de>
    Reviewed-by: default avatarAlexandre Chartre <alexandre.chartre@oracle.com>
    Acked-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
    Acked-by: default avatarPeter Zijlstra <peterz@infradead.org>
    Link: https://lkml.kernel.org/r/20200505134059.262701431@linutronix.de
    
    6bca69ad
kvm.c 23.7 KB