1. 27 Dec, 2022 34 commits
  2. 23 Dec, 2022 6 commits
    • Sean Christopherson's avatar
      KVM: x86/mmu: Don't install TDP MMU SPTE if SP has unexpected level · 50a9ac25
      Sean Christopherson authored
      Don't install a leaf TDP MMU SPTE if the parent page's level doesn't
      match the target level of the fault, and instead have the vCPU retry the
      faulting instruction after warning.  Continuing on is completely
      unnecessary as the absolute worst case scenario of retrying is DoSing
      the vCPU, whereas continuing on all but guarantees bigger explosions, e.g.
      
        ------------[ cut here ]------------
        kernel BUG at arch/x86/kvm/mmu/tdp_mmu.c:559!
        invalid opcode: 0000 [#1] SMP
        CPU: 1 PID: 1025 Comm: nx_huge_pages_t Tainted: G        W          6.1.0-rc4+ #64
        Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/2015
        RIP: 0010:__handle_changed_spte.cold+0x95/0x9c
        RSP: 0018:ffffc9000072faf8 EFLAGS: 00010246
        RAX: 00000000000000c1 RBX: ffffc90000731000 RCX: 0000000000000027
        RDX: 0000000000000000 RSI: 00000000ffffdfff RDI: ffff888277c5b4c8
        RBP: 0600000112400bf3 R08: ffff888277c5b4c0 R09: ffffc9000072f9a0
        R10: 0000000000000001 R11: 0000000000000001 R12: 06000001126009f3
        R13: 0000000000000002 R14: 0000000012600901 R15: 0000000012400b01
        FS:  00007fba9f853740(0000) GS:ffff888277c40000(0000) knlGS:0000000000000000
        CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
        CR2: 0000000000000000 CR3: 000000010aa7a003 CR4: 0000000000172ea0
        Call Trace:
         <TASK>
         kvm_tdp_mmu_map+0x3b0/0x510
         kvm_tdp_page_fault+0x10c/0x130
         kvm_mmu_page_fault+0x103/0x680
         vmx_handle_exit+0x132/0x5a0 [kvm_intel]
         vcpu_enter_guest+0x60c/0x16f0
         kvm_arch_vcpu_ioctl_run+0x1e2/0x9d0
         kvm_vcpu_ioctl+0x271/0x660
         __x64_sys_ioctl+0x80/0xb0
         do_syscall_64+0x2b/0x50
         entry_SYSCALL_64_after_hwframe+0x46/0xb0
         </TASK>
        Modules linked in: kvm_intel
        ---[ end trace 0000000000000000 ]---
      Signed-off-by: default avatarSean Christopherson <seanjc@google.com>
      Message-Id: <20221213033030.83345-5-seanjc@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      50a9ac25
    • Sean Christopherson's avatar
      KVM: x86/mmu: Re-check under lock that TDP MMU SP hugepage is disallowed · 21a36ac6
      Sean Christopherson authored
      Re-check sp->nx_huge_page_disallowed under the tdp_mmu_pages_lock spinlock
      when adding a new shadow page in the TDP MMU.  To ensure the NX reclaim
      kthread can't see a not-yet-linked shadow page, the page fault path links
      the new page table prior to adding the page to possible_nx_huge_pages.
      
      If the page is zapped by different task, e.g. because dirty logging is
      disabled, between linking the page and adding it to the list, KVM can end
      up triggering use-after-free by adding the zapped SP to the aforementioned
      list, as the zapped SP's memory is scheduled for removal via RCU callback.
      The bug is detected by the sanity checks guarded by CONFIG_DEBUG_LIST=y,
      i.e. the below splat is just one possible signature.
      
        ------------[ cut here ]------------
        list_add corruption. prev->next should be next (ffffc9000071fa70), but was ffff88811125ee38. (prev=ffff88811125ee38).
        WARNING: CPU: 1 PID: 953 at lib/list_debug.c:30 __list_add_valid+0x79/0xa0
        Modules linked in: kvm_intel
        CPU: 1 PID: 953 Comm: nx_huge_pages_t Tainted: G        W          6.1.0-rc4+ #71
        Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/2015
        RIP: 0010:__list_add_valid+0x79/0xa0
        RSP: 0018:ffffc900006efb68 EFLAGS: 00010286
        RAX: 0000000000000000 RBX: ffff888116cae8a0 RCX: 0000000000000027
        RDX: 0000000000000027 RSI: 0000000100001872 RDI: ffff888277c5b4c8
        RBP: ffffc90000717000 R08: ffff888277c5b4c0 R09: ffffc900006efa08
        R10: 0000000000199998 R11: 0000000000199a20 R12: ffff888116cae930
        R13: ffff88811125ee38 R14: ffffc9000071fa70 R15: ffff88810b794f90
        FS:  00007fc0415d2740(0000) GS:ffff888277c40000(0000) knlGS:0000000000000000
        CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
        CR2: 0000000000000000 CR3: 0000000115201006 CR4: 0000000000172ea0
        Call Trace:
         <TASK>
         track_possible_nx_huge_page+0x53/0x80
         kvm_tdp_mmu_map+0x242/0x2c0
         kvm_tdp_page_fault+0x10c/0x130
         kvm_mmu_page_fault+0x103/0x680
         vmx_handle_exit+0x132/0x5a0 [kvm_intel]
         vcpu_enter_guest+0x60c/0x16f0
         kvm_arch_vcpu_ioctl_run+0x1e2/0x9d0
         kvm_vcpu_ioctl+0x271/0x660
         __x64_sys_ioctl+0x80/0xb0
         do_syscall_64+0x2b/0x50
         entry_SYSCALL_64_after_hwframe+0x46/0xb0
         </TASK>
        ---[ end trace 0000000000000000 ]---
      
      Fixes: 61f94478 ("KVM: x86/mmu: Set disallowed_nx_huge_page in TDP MMU before setting SPTE")
      Reported-by: default avatarGreg Thelen <gthelen@google.com>
      Analyzed-by: default avatarDavid Matlack <dmatlack@google.com>
      Cc: David Matlack <dmatlack@google.com>
      Cc: Ben Gardon <bgardon@google.com>
      Cc: Mingwei Zhang <mizhang@google.com>
      Signed-off-by: default avatarSean Christopherson <seanjc@google.com>
      Message-Id: <20221213033030.83345-4-seanjc@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      21a36ac6
    • Sean Christopherson's avatar
      KVM: x86/mmu: Map TDP MMU leaf SPTE iff target level is reached · 80a3e4ae
      Sean Christopherson authored
      Map the leaf SPTE when handling a TDP MMU page fault if and only if the
      target level is reached.  A recent commit reworked the retry logic and
      incorrectly assumed that walking SPTEs would never "fail", as the loop
      either bails (retries) or installs parent SPs.  However, the iterator
      itself will bail early if it detects a frozen (REMOVED) SPTE when
      stepping down.   The TDP iterator also rereads the current SPTE before
      stepping down specifically to avoid walking into a part of the tree that
      is being removed, which means it's possible to terminate the loop without
      the guts of the loop observing the frozen SPTE, e.g. if a different task
      zaps a parent SPTE between the initial read and try_step_down()'s refresh.
      
      Mapping a leaf SPTE at the wrong level results in all kinds of badness as
      page table walkers interpret the SPTE as a page table, not a leaf, and
      walk into the weeds.
      
        ------------[ cut here ]------------
        WARNING: CPU: 1 PID: 1025 at arch/x86/kvm/mmu/tdp_mmu.c:1070 kvm_tdp_mmu_map+0x481/0x510
        Modules linked in: kvm_intel
        CPU: 1 PID: 1025 Comm: nx_huge_pages_t Tainted: G        W          6.1.0-rc4+ #64
        Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/2015
        RIP: 0010:kvm_tdp_mmu_map+0x481/0x510
        RSP: 0018:ffffc9000072fba8 EFLAGS: 00010286
        RAX: 0000000000000000 RBX: ffffc9000072fcc0 RCX: 0000000000000027
        RDX: 0000000000000027 RSI: 00000000ffffdfff RDI: ffff888277c5b4c8
        RBP: ffff888107d45a10 R08: ffff888277c5b4c0 R09: ffffc9000072fa48
        R10: 0000000000000001 R11: 0000000000000001 R12: ffffc9000073a0e0
        R13: ffff88810fc54800 R14: ffff888107d1ae60 R15: ffff88810fc54f90
        FS:  00007fba9f853740(0000) GS:ffff888277c40000(0000) knlGS:0000000000000000
        CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
        CR2: 0000000000000000 CR3: 000000010aa7a003 CR4: 0000000000172ea0
        Call Trace:
         <TASK>
         kvm_tdp_page_fault+0x10c/0x130
         kvm_mmu_page_fault+0x103/0x680
         vmx_handle_exit+0x132/0x5a0 [kvm_intel]
         vcpu_enter_guest+0x60c/0x16f0
         kvm_arch_vcpu_ioctl_run+0x1e2/0x9d0
         kvm_vcpu_ioctl+0x271/0x660
         __x64_sys_ioctl+0x80/0xb0
         do_syscall_64+0x2b/0x50
         entry_SYSCALL_64_after_hwframe+0x46/0xb0
         </TASK>
        ---[ end trace 0000000000000000 ]---
        Invalid SPTE change: cannot replace a present leaf
        SPTE with another present leaf SPTE mapping a
        different PFN!
        as_id: 0 gfn: 100200 old_spte: 600000112400bf3 new_spte: 6000001126009f3 level: 2
        ------------[ cut here ]------------
        kernel BUG at arch/x86/kvm/mmu/tdp_mmu.c:559!
        invalid opcode: 0000 [#1] SMP
        CPU: 1 PID: 1025 Comm: nx_huge_pages_t Tainted: G        W          6.1.0-rc4+ #64
        Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/2015
        RIP: 0010:__handle_changed_spte.cold+0x95/0x9c
        RSP: 0018:ffffc9000072faf8 EFLAGS: 00010246
        RAX: 00000000000000c1 RBX: ffffc90000731000 RCX: 0000000000000027
        RDX: 0000000000000000 RSI: 00000000ffffdfff RDI: ffff888277c5b4c8
        RBP: 0600000112400bf3 R08: ffff888277c5b4c0 R09: ffffc9000072f9a0
        R10: 0000000000000001 R11: 0000000000000001 R12: 06000001126009f3
        R13: 0000000000000002 R14: 0000000012600901 R15: 0000000012400b01
        FS:  00007fba9f853740(0000) GS:ffff888277c40000(0000) knlGS:0000000000000000
        CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
        CR2: 0000000000000000 CR3: 000000010aa7a003 CR4: 0000000000172ea0
        Call Trace:
         <TASK>
         kvm_tdp_mmu_map+0x3b0/0x510
         kvm_tdp_page_fault+0x10c/0x130
         kvm_mmu_page_fault+0x103/0x680
         vmx_handle_exit+0x132/0x5a0 [kvm_intel]
         vcpu_enter_guest+0x60c/0x16f0
         kvm_arch_vcpu_ioctl_run+0x1e2/0x9d0
         kvm_vcpu_ioctl+0x271/0x660
         __x64_sys_ioctl+0x80/0xb0
         do_syscall_64+0x2b/0x50
         entry_SYSCALL_64_after_hwframe+0x46/0xb0
         </TASK>
        Modules linked in: kvm_intel
        ---[ end trace 0000000000000000 ]---
      
      Fixes: 63d28a25 ("KVM: x86/mmu: simplify kvm_tdp_mmu_map flow when guest has to retry")
      Cc: Robert Hoo <robert.hu@linux.intel.com>
      Signed-off-by: default avatarSean Christopherson <seanjc@google.com>
      Message-Id: <20221213033030.83345-3-seanjc@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      80a3e4ae
    • Sean Christopherson's avatar
      KVM: x86/mmu: Don't attempt to map leaf if target TDP MMU SPTE is frozen · f5d16bb9
      Sean Christopherson authored
      Hoist the is_removed_spte() check above the "level == goal_level" check
      when walking SPTEs during a TDP MMU page fault to avoid attempting to map
      a leaf entry if said entry is frozen by a different task/vCPU.
      
        ------------[ cut here ]------------
        WARNING: CPU: 3 PID: 939 at arch/x86/kvm/mmu/tdp_mmu.c:653 kvm_tdp_mmu_map+0x269/0x4b0
        Modules linked in: kvm_intel
        CPU: 3 PID: 939 Comm: nx_huge_pages_t Not tainted 6.1.0-rc4+ #67
        Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/2015
        RIP: 0010:kvm_tdp_mmu_map+0x269/0x4b0
        RSP: 0018:ffffc9000068fba8 EFLAGS: 00010246
        RAX: 00000000000005a0 RBX: ffffc9000068fcc0 RCX: 0000000000000005
        RDX: ffff88810741f000 RSI: ffff888107f04600 RDI: ffffc900006a3000
        RBP: 060000010b000bf3 R08: 0000000000000000 R09: 0000000000000000
        R10: 0000000000000000 R11: 000ffffffffff000 R12: 0000000000000005
        R13: ffff888113670000 R14: ffff888107464958 R15: 0000000000000000
        FS:  00007f01c942c740(0000) GS:ffff888277cc0000(0000) knlGS:0000000000000000
        CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
        CR2: 0000000000000000 CR3: 0000000117013006 CR4: 0000000000172ea0
        Call Trace:
         <TASK>
         kvm_tdp_page_fault+0x10c/0x130
         kvm_mmu_page_fault+0x103/0x680
         vmx_handle_exit+0x132/0x5a0 [kvm_intel]
         vcpu_enter_guest+0x60c/0x16f0
         kvm_arch_vcpu_ioctl_run+0x1e2/0x9d0
         kvm_vcpu_ioctl+0x271/0x660
         __x64_sys_ioctl+0x80/0xb0
         do_syscall_64+0x2b/0x50
         entry_SYSCALL_64_after_hwframe+0x46/0xb0
         </TASK>
        ---[ end trace 0000000000000000 ]---
      
      Fixes: 63d28a25 ("KVM: x86/mmu: simplify kvm_tdp_mmu_map flow when guest has to retry")
      Cc: Robert Hoo <robert.hu@linux.intel.com>
      Signed-off-by: default avatarSean Christopherson <seanjc@google.com>
      Reviewed-by: default avatarRobert Hoo <robert.hu@linux.intel.com>
      Message-Id: <20221213033030.83345-2-seanjc@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      f5d16bb9
    • Sean Christopherson's avatar
      KVM: nVMX: Don't stuff secondary execution control if it's not supported · a0860d68
      Sean Christopherson authored
      When stuffing the allowed secondary execution controls for nested VMX in
      response to CPUID updates, don't set the allowed-1 bit for a feature that
      isn't supported by KVM, i.e. isn't allowed by the canonical vmcs_config.
      
      WARN if KVM attempts to manipulate a feature that isn't supported.  All
      features that are currently stuffed are always advertised to L1 for
      nested VMX if they are supported in KVM's base configuration, and no
      additional features should ever be added to the CPUID-induced stuffing
      (updating VMX MSRs in response to CPUID updates is a long-standing KVM
      flaw that is slowly being fixed).
      Signed-off-by: default avatarSean Christopherson <seanjc@google.com>
      Message-Id: <20221213062306.667649-3-seanjc@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      a0860d68
    • Sean Christopherson's avatar
      KVM: nVMX: Properly expose ENABLE_USR_WAIT_PAUSE control to L1 · 31de69f4
      Sean Christopherson authored
      Set ENABLE_USR_WAIT_PAUSE in KVM's supported VMX MSR configuration if the
      feature is supported in hardware and enabled in KVM's base, non-nested
      configuration, i.e. expose ENABLE_USR_WAIT_PAUSE to L1 if it's supported.
      This fixes a bug where saving/restoring, i.e. migrating, a vCPU will fail
      if WAITPKG (the associated CPUID feature) is enabled for the vCPU, and
      obviously allows L1 to enable the feature for L2.
      
      KVM already effectively exposes ENABLE_USR_WAIT_PAUSE to L1 by stuffing
      the allowed-1 control ina vCPU's virtual MSR_IA32_VMX_PROCBASED_CTLS2 when
      updating secondary controls in response to KVM_SET_CPUID(2), but (a) that
      depends on flawed code (KVM shouldn't touch VMX MSRs in response to CPUID
      updates) and (b) runs afoul of vmx_restore_control_msr()'s restriction
      that the guest value must be a strict subset of the supported host value.
      
      Although no past commit explicitly enabled nested support for WAITPKG,
      doing so is safe and functionally correct from an architectural
      perspective as no additional KVM support is needed to virtualize TPAUSE,
      UMONITOR, and UMWAIT for L2 relative to L1, and KVM already forwards
      VM-Exits to L1 as necessary (commit bf653b78, "KVM: vmx: Introduce
      handle_unexpected_vmexit and handle WAITPKG vmexit").
      
      Note, KVM always keeps the hosts MSR_IA32_UMWAIT_CONTROL resident in
      hardware, i.e. always runs both L1 and L2 with the host's power management
      settings for TPAUSE and UMWAIT.  See commit bf09fb6c ("KVM: VMX: Stop
      context switching MSR_IA32_UMWAIT_CONTROL") for more details.
      
      Fixes: e69e72fa ("KVM: x86: Add support for user wait instructions")
      Cc: stable@vger.kernel.org
      Reported-by: default avatarAaron Lewis <aaronlewis@google.com>
      Reported-by: default avatarYu Zhang <yu.c.zhang@linux.intel.com>
      Signed-off-by: default avatarSean Christopherson <seanjc@google.com>
      Reviewed-by: default avatarJim Mattson <jmattson@google.com>
      Message-Id: <20221213062306.667649-2-seanjc@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      31de69f4