• Sean Christopherson's avatar
    KVM: doc: Document the life cycle of a VM and its resources · eca6be56
    Sean Christopherson authored
    The series to add memcg accounting to KVM allocations[1] states:
    
      There are many KVM kernel memory allocations which are tied to the
      life of the VM process and should be charged to the VM process's
      cgroup.
    
    While it is correct to account KVM kernel allocations to the cgroup of
    the process that created the VM, it's technically incorrect to state
    that the KVM kernel memory allocations are tied to the life of the VM
    process.  This is because the VM itself, i.e. struct kvm, is not tied to
    the life of the process which created it, rather it is tied to the life
    of its associated file descriptor.  In other words, kvm_destroy_vm() is
    not invoked until fput() decrements its associated file's refcount to
    zero.  A simple example is to fork() in Qemu and have the child sleep
    indefinitely; kvm_destroy_vm() isn't called until Qemu closes its file
    descriptor *and* the rogue child is killed.
    
    The allocations are guaranteed to be *accounted* to the process which
    created the VM, but only because KVM's per-{VM,vCPU} ioctls reject the
    ioctl() with -EIO if kvm->mm != current->mm.  I.e. the child can keep
    the VM "alive" but can't do anything useful with its reference.
    
    Note that because 'struct kvm' also holds a reference to the mm_struct
    of its owner, the above behavior also applies to userspace allocations.
    
    Given that mucking with a VM's file descriptor can lead to subtle and
    undesirable behavior, e.g. memcg charges persisting after a VM is shut
    down, explicitly document a VM's lifecycle and its impact on the VM's
    resources.
    
    Alternatively, KVM could aggressively free resources when the creating
    process exits, e.g. via mmu_notifier->release().  However, mmu_notifier
    isn't guaranteed to be available, and freeing resources when the creator
    exits is likely to be error prone and fragile as KVM would need to
    ensure that it only freed resources that are truly out of reach. In
    practice, the existing behavior shouldn't be problematic as a properly
    configured system will prevent a child process from being moved out of
    the appropriate cgroup hierarchy, i.e. prevent hiding the process from
    the OOM killer, and will prevent an unprivileged user from being able to
    to hold a reference to struct kvm via another method, e.g. debugfs.
    
    [1]https://patchwork.kernel.org/patch/10806707/Signed-off-by: default avatarSean Christopherson <sean.j.christopherson@intel.com>
    Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
    eca6be56
api.txt 169 KB