Commit 3a755ebc authored by Linus Torvalds's avatar Linus Torvalds

Merge tag 'x86_tdx_for_v5.19_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull Intel TDX support from Borislav Petkov:
 "Intel Trust Domain Extensions (TDX) support.

  This is the Intel version of a confidential computing solution called
  Trust Domain Extensions (TDX). This series adds support to run the
  kernel as part of a TDX guest. It provides similar guest protections
  to AMD's SEV-SNP like guest memory and register state encryption,
  memory integrity protection and a lot more.

  Design-wise, it differs from AMD's solution considerably: it uses a
  software module which runs in a special CPU mode called (Secure
  Arbitration Mode) SEAM. As the name suggests, this module serves as
  sort of an arbiter which the confidential guest calls for services it
  needs during its lifetime.

  Just like AMD's SNP set, this series reworks and streamlines certain
  parts of x86 arch code so that this feature can be properly
  accomodated"

* tag 'x86_tdx_for_v5.19_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (34 commits)
  x86/tdx: Fix RETs in TDX asm
  x86/tdx: Annotate a noreturn function
  x86/mm: Fix spacing within memory encryption features message
  x86/kaslr: Fix build warning in KASLR code in boot stub
  Documentation/x86: Document TDX kernel architecture
  ACPICA: Avoid cache flush inside virtual machines
  x86/tdx/ioapic: Add shared bit for IOAPIC base address
  x86/mm: Make DMA memory shared for TD guest
  x86/mm/cpa: Add support for TDX shared memory
  x86/tdx: Make pages shared in ioremap()
  x86/topology: Disable CPU online/offline control for TDX guests
  x86/boot: Avoid #VE during boot for TDX platforms
  x86/boot: Set CR0.NE early and keep it set during the boot
  x86/acpi/x86/boot: Add multiprocessor wake-up support
  x86/boot: Add a trampoline for booting APs via firmware handoff
  x86/tdx: Wire up KVM hypercalls
  x86/tdx: Port I/O: Add early boot support
  x86/tdx: Port I/O: Add runtime hypercalls
  x86/boot: Port I/O: Add decompression-time support for TDX
  x86/boot: Port I/O: Allow to hook up alternative helpers
  ...
parents 5b828263 c796f021
......@@ -26,6 +26,7 @@ x86-specific Documentation
intel_txt
amd-memory-encryption
amd_hsmp
tdx
pti
mds
microcode
......
.. SPDX-License-Identifier: GPL-2.0
=====================================
Intel Trust Domain Extensions (TDX)
=====================================
Intel's Trust Domain Extensions (TDX) protect confidential guest VMs from
the host and physical attacks by isolating the guest register state and by
encrypting the guest memory. In TDX, a special module running in a special
mode sits between the host and the guest and manages the guest/host
separation.
Since the host cannot directly access guest registers or memory, much
normal functionality of a hypervisor must be moved into the guest. This is
implemented using a Virtualization Exception (#VE) that is handled by the
guest kernel. A #VE is handled entirely inside the guest kernel, but some
require the hypervisor to be consulted.
TDX includes new hypercall-like mechanisms for communicating from the
guest to the hypervisor or the TDX module.
New TDX Exceptions
==================
TDX guests behave differently from bare-metal and traditional VMX guests.
In TDX guests, otherwise normal instructions or memory accesses can cause
#VE or #GP exceptions.
Instructions marked with an '*' conditionally cause exceptions. The
details for these instructions are discussed below.
Instruction-based #VE
---------------------
- Port I/O (INS, OUTS, IN, OUT)
- HLT
- MONITOR, MWAIT
- WBINVD, INVD
- VMCALL
- RDMSR*,WRMSR*
- CPUID*
Instruction-based #GP
---------------------
- All VMX instructions: INVEPT, INVVPID, VMCLEAR, VMFUNC, VMLAUNCH,
VMPTRLD, VMPTRST, VMREAD, VMRESUME, VMWRITE, VMXOFF, VMXON
- ENCLS, ENCLU
- GETSEC
- RSM
- ENQCMD
- RDMSR*,WRMSR*
RDMSR/WRMSR Behavior
--------------------
MSR access behavior falls into three categories:
- #GP generated
- #VE generated
- "Just works"
In general, the #GP MSRs should not be used in guests. Their use likely
indicates a bug in the guest. The guest may try to handle the #GP with a
hypercall but it is unlikely to succeed.
The #VE MSRs are typically able to be handled by the hypervisor. Guests
can make a hypercall to the hypervisor to handle the #VE.
The "just works" MSRs do not need any special guest handling. They might
be implemented by directly passing through the MSR to the hardware or by
trapping and handling in the TDX module. Other than possibly being slow,
these MSRs appear to function just as they would on bare metal.
CPUID Behavior
--------------
For some CPUID leaves and sub-leaves, the virtualized bit fields of CPUID
return values (in guest EAX/EBX/ECX/EDX) are configurable by the
hypervisor. For such cases, the Intel TDX module architecture defines two
virtualization types:
- Bit fields for which the hypervisor controls the value seen by the guest
TD.
- Bit fields for which the hypervisor configures the value such that the
guest TD either sees their native value or a value of 0. For these bit
fields, the hypervisor can mask off the native values, but it can not
turn *on* values.
A #VE is generated for CPUID leaves and sub-leaves that the TDX module does
not know how to handle. The guest kernel may ask the hypervisor for the
value with a hypercall.
#VE on Memory Accesses
======================
There are essentially two classes of TDX memory: private and shared.
Private memory receives full TDX protections. Its content is protected
against access from the hypervisor. Shared memory is expected to be
shared between guest and hypervisor and does not receive full TDX
protections.
A TD guest is in control of whether its memory accesses are treated as
private or shared. It selects the behavior with a bit in its page table
entries. This helps ensure that a guest does not place sensitive
information in shared memory, exposing it to the untrusted hypervisor.
#VE on Shared Memory
--------------------
Access to shared mappings can cause a #VE. The hypervisor ultimately
controls whether a shared memory access causes a #VE, so the guest must be
careful to only reference shared pages it can safely handle a #VE. For
instance, the guest should be careful not to access shared memory in the
#VE handler before it reads the #VE info structure (TDG.VP.VEINFO.GET).
Shared mapping content is entirely controlled by the hypervisor. The guest
should only use shared mappings for communicating with the hypervisor.
Shared mappings must never be used for sensitive memory content like kernel
stacks. A good rule of thumb is that hypervisor-shared memory should be
treated the same as memory mapped to userspace. Both the hypervisor and
userspace are completely untrusted.
MMIO for virtual devices is implemented as shared memory. The guest must
be careful not to access device MMIO regions unless it is also prepared to
handle a #VE.
#VE on Private Pages
--------------------
An access to private mappings can also cause a #VE. Since all kernel
memory is also private memory, the kernel might theoretically need to
handle a #VE on arbitrary kernel memory accesses. This is not feasible, so
TDX guests ensure that all guest memory has been "accepted" before memory
is used by the kernel.
A modest amount of memory (typically 512M) is pre-accepted by the firmware
before the kernel runs to ensure that the kernel can start up without
being subjected to a #VE.
The hypervisor is permitted to unilaterally move accepted pages to a
"blocked" state. However, if it does this, page access will not generate a
#VE. It will, instead, cause a "TD Exit" where the hypervisor is required
to handle the exception.
Linux #VE handler
=================
Just like page faults or #GP's, #VE exceptions can be either handled or be
fatal. Typically, an unhandled userspace #VE results in a SIGSEGV.
An unhandled kernel #VE results in an oops.
Handling nested exceptions on x86 is typically nasty business. A #VE
could be interrupted by an NMI which triggers another #VE and hilarity
ensues. The TDX #VE architecture anticipated this scenario and includes a
feature to make it slightly less nasty.
During #VE handling, the TDX module ensures that all interrupts (including
NMIs) are blocked. The block remains in place until the guest makes a
TDG.VP.VEINFO.GET TDCALL. This allows the guest to control when interrupts
or a new #VE can be delivered.
However, the guest kernel must still be careful to avoid potential
#VE-triggering actions (discussed above) while this block is in place.
While the block is in place, any #VE is elevated to a double fault (#DF)
which is not recoverable.
MMIO handling
=============
In non-TDX VMs, MMIO is usually implemented by giving a guest access to a
mapping which will cause a VMEXIT on access, and then the hypervisor
emulates the access. That is not possible in TDX guests because VMEXIT
will expose the register state to the host. TDX guests don't trust the host
and can't have their state exposed to the host.
In TDX, MMIO regions typically trigger a #VE exception in the guest. The
guest #VE handler then emulates the MMIO instruction inside the guest and
converts it into a controlled TDCALL to the host, rather than exposing
guest state to the host.
MMIO addresses on x86 are just special physical addresses. They can
theoretically be accessed with any instruction that accesses memory.
However, the kernel instruction decoding method is limited. It is only
designed to decode instructions like those generated by io.h macros.
MMIO access via other means (like structure overlays) may result in an
oops.
Shared Memory Conversions
=========================
All TDX guest memory starts out as private at boot. This memory can not
be accessed by the hypervisor. However, some kernel users like device
drivers might have a need to share data with the hypervisor. To do this,
memory must be converted between shared and private. This can be
accomplished using some existing memory encryption helpers:
* set_memory_decrypted() converts a range of pages to shared.
* set_memory_encrypted() converts memory back to private.
Device drivers are the primary user of shared memory, but there's no need
to touch every driver. DMA buffers and ioremap() do the conversions
automatically.
TDX uses SWIOTLB for most DMA allocations. The SWIOTLB buffer is
converted to shared on boot.
For coherent DMA allocation, the DMA buffer gets converted on the
allocation. Check force_dma_unencrypted() for details.
References
==========
TDX reference material is collected here:
https://www.intel.com/content/www/us/en/developer/articles/technical/intel-trust-domain-extensions.html
......@@ -878,6 +878,21 @@ config ACRN_GUEST
IOT with small footprint and real-time features. More details can be
found in https://projectacrn.org/.
config INTEL_TDX_GUEST
bool "Intel TDX (Trust Domain Extensions) - Guest Support"
depends on X86_64 && CPU_SUP_INTEL
depends on X86_X2APIC
select ARCH_HAS_CC_PLATFORM
select X86_MEM_ENCRYPT
select X86_MCE
help
Support running as a guest under Intel TDX. Without this support,
the guest kernel can not boot or run under TDX.
TDX includes memory encryption and integrity capabilities
which protect the confidentiality and integrity of guest
memory contents and CPU state. TDX guests are protected from
some attacks from the VMM.
endif #HYPERVISOR_GUEST
source "arch/x86/Kconfig.cpu"
......
......@@ -26,6 +26,7 @@
#include "bitops.h"
#include "ctype.h"
#include "cpuflags.h"
#include "io.h"
/* Useful macros */
#define ARRAY_SIZE(x) (sizeof(x) / sizeof(*(x)))
......@@ -35,44 +36,10 @@ extern struct boot_params boot_params;
#define cpu_relax() asm volatile("rep; nop")
/* Basic port I/O */
static inline void outb(u8 v, u16 port)
{
asm volatile("outb %0,%1" : : "a" (v), "dN" (port));
}
static inline u8 inb(u16 port)
{
u8 v;
asm volatile("inb %1,%0" : "=a" (v) : "dN" (port));
return v;
}
static inline void outw(u16 v, u16 port)
{
asm volatile("outw %0,%1" : : "a" (v), "dN" (port));
}
static inline u16 inw(u16 port)
{
u16 v;
asm volatile("inw %1,%0" : "=a" (v) : "dN" (port));
return v;
}
static inline void outl(u32 v, u16 port)
{
asm volatile("outl %0,%1" : : "a" (v), "dN" (port));
}
static inline u32 inl(u16 port)
{
u32 v;
asm volatile("inl %1,%0" : "=a" (v) : "dN" (port));
return v;
}
static inline void io_delay(void)
{
const u16 DELAY_PORT = 0x80;
asm volatile("outb %%al,%0" : : "dN" (DELAY_PORT));
outb(0, DELAY_PORT);
}
/* These functions are used to reference data in other segments. */
......
......@@ -101,6 +101,7 @@ ifdef CONFIG_X86_64
endif
vmlinux-objs-$(CONFIG_ACPI) += $(obj)/acpi.o
vmlinux-objs-$(CONFIG_INTEL_TDX_GUEST) += $(obj)/tdx.o $(obj)/tdcall.o
vmlinux-objs-$(CONFIG_EFI_MIXED) += $(obj)/efi_thunk_$(BITS).o
vmlinux-objs-$(CONFIG_EFI) += $(obj)/efi.o
......
......@@ -289,7 +289,7 @@ SYM_FUNC_START(startup_32)
pushl %eax
/* Enter paged protected Mode, activating Long Mode */
movl $(X86_CR0_PG | X86_CR0_PE), %eax /* Enable Paging and Protected mode */
movl $CR0_STATE, %eax
movl %eax, %cr0
/* Jump from 32bit compatibility mode into 64bit mode. */
......@@ -649,12 +649,28 @@ SYM_CODE_START(trampoline_32bit_src)
movl $MSR_EFER, %ecx
rdmsr
btsl $_EFER_LME, %eax
/* Avoid writing EFER if no change was made (for TDX guest) */
jc 1f
wrmsr
popl %edx
1: popl %edx
popl %ecx
#ifdef CONFIG_X86_MCE
/*
* Preserve CR4.MCE if the kernel will enable #MC support.
* Clearing MCE may fault in some environments (that also force #MC
* support). Any machine check that occurs before #MC support is fully
* configured will crash the system regardless of the CR4.MCE value set
* here.
*/
movl %cr4, %eax
andl $X86_CR4_MCE, %eax
#else
movl $0, %eax
#endif
/* Enable PAE and LA57 (if required) paging modes */
movl $X86_CR4_PAE, %eax
orl $X86_CR4_PAE, %eax
testl %edx, %edx
jz 1f
orl $X86_CR4_LA57, %eax
......@@ -668,8 +684,9 @@ SYM_CODE_START(trampoline_32bit_src)
pushl $__KERNEL_CS
pushl %eax
/* Enable paging again */
movl $(X86_CR0_PG | X86_CR0_PE), %eax
/* Enable paging again. */
movl %cr0, %eax
btsl $X86_CR0_PG_BIT, %eax
movl %eax, %cr0
lret
......
......@@ -48,6 +48,8 @@ void *memmove(void *dest, const void *src, size_t n);
*/
struct boot_params *boot_params;
struct port_io_ops pio_ops;
memptr free_mem_ptr;
memptr free_mem_end_ptr;
......@@ -374,6 +376,16 @@ asmlinkage __visible void *extract_kernel(void *rmode, memptr heap,
lines = boot_params->screen_info.orig_video_lines;
cols = boot_params->screen_info.orig_video_cols;
init_default_io_ops();
/*
* Detect TDX guest environment.
*
* It has to be done before console_init() in order to use
* paravirtualized port I/O operations if needed.
*/
early_tdx_detect();
console_init();
/*
......
......@@ -22,17 +22,19 @@
#include <linux/linkage.h>
#include <linux/screen_info.h>
#include <linux/elf.h>
#include <linux/io.h>
#include <asm/page.h>
#include <asm/boot.h>
#include <asm/bootparam.h>
#include <asm/desc_defs.h>
#include "tdx.h"
#define BOOT_CTYPE_H
#include <linux/acpi.h>
#define BOOT_BOOT_H
#include "../ctype.h"
#include "../io.h"
#include "efi.h"
......
......@@ -6,7 +6,7 @@
#define TRAMPOLINE_32BIT_PGTABLE_OFFSET 0
#define TRAMPOLINE_32BIT_CODE_OFFSET PAGE_SIZE
#define TRAMPOLINE_32BIT_CODE_SIZE 0x70
#define TRAMPOLINE_32BIT_CODE_SIZE 0x80
#define TRAMPOLINE_32BIT_STACK_END TRAMPOLINE_32BIT_SIZE
......
/* SPDX-License-Identifier: GPL-2.0 */
#include "../../coco/tdx/tdcall.S"
// SPDX-License-Identifier: GPL-2.0
#include "../cpuflags.h"
#include "../string.h"
#include "../io.h"
#include "error.h"
#include <vdso/limits.h>
#include <uapi/asm/vmx.h>
#include <asm/shared/tdx.h>
/* Called from __tdx_hypercall() for unrecoverable failure */
void __tdx_hypercall_failed(void)
{
error("TDVMCALL failed. TDX module bug?");
}
static inline unsigned int tdx_io_in(int size, u16 port)
{
struct tdx_hypercall_args args = {
.r10 = TDX_HYPERCALL_STANDARD,
.r11 = EXIT_REASON_IO_INSTRUCTION,
.r12 = size,
.r13 = 0,
.r14 = port,
};
if (__tdx_hypercall(&args, TDX_HCALL_HAS_OUTPUT))
return UINT_MAX;
return args.r11;
}
static inline void tdx_io_out(int size, u16 port, u32 value)
{
struct tdx_hypercall_args args = {
.r10 = TDX_HYPERCALL_STANDARD,
.r11 = EXIT_REASON_IO_INSTRUCTION,
.r12 = size,
.r13 = 1,
.r14 = port,
.r15 = value,
};
__tdx_hypercall(&args, 0);
}
static inline u8 tdx_inb(u16 port)
{
return tdx_io_in(1, port);
}
static inline void tdx_outb(u8 value, u16 port)
{
tdx_io_out(1, port, value);
}
static inline void tdx_outw(u16 value, u16 port)
{
tdx_io_out(2, port, value);
}
void early_tdx_detect(void)
{
u32 eax, sig[3];
cpuid_count(TDX_CPUID_LEAF_ID, 0, &eax, &sig[0], &sig[2], &sig[1]);
if (memcmp(TDX_IDENT, sig, sizeof(sig)))
return;
/* Use hypercalls instead of I/O instructions */
pio_ops.f_inb = tdx_inb;
pio_ops.f_outb = tdx_outb;
pio_ops.f_outw = tdx_outw;
}
/* SPDX-License-Identifier: GPL-2.0 */
#ifndef BOOT_COMPRESSED_TDX_H
#define BOOT_COMPRESSED_TDX_H
#include <linux/types.h>
#ifdef CONFIG_INTEL_TDX_GUEST
void early_tdx_detect(void);
#else
static inline void early_tdx_detect(void) { };
#endif
#endif /* BOOT_COMPRESSED_TDX_H */
......@@ -71,8 +71,7 @@ int has_eflag(unsigned long mask)
# define EBX_REG "=b"
#endif
static inline void cpuid_count(u32 id, u32 count,
u32 *a, u32 *b, u32 *c, u32 *d)
void cpuid_count(u32 id, u32 count, u32 *a, u32 *b, u32 *c, u32 *d)
{
asm volatile(".ifnc %%ebx,%3 ; movl %%ebx,%3 ; .endif \n\t"
"cpuid \n\t"
......
......@@ -17,5 +17,6 @@ extern u32 cpu_vendor[3];
int has_eflag(unsigned long mask);
void get_cpuflags(void);
void cpuid_count(u32 id, u32 count, u32 *a, u32 *b, u32 *c, u32 *d);
#endif
/* SPDX-License-Identifier: GPL-2.0 */
#ifndef BOOT_IO_H
#define BOOT_IO_H
#include <asm/shared/io.h>
#undef inb
#undef inw
#undef inl
#undef outb
#undef outw
#undef outl
struct port_io_ops {
u8 (*f_inb)(u16 port);
void (*f_outb)(u8 v, u16 port);
void (*f_outw)(u16 v, u16 port);
};
extern struct port_io_ops pio_ops;
/*
* Use the normal I/O instructions by default.
* TDX guests override these to use hypercalls.
*/
static inline void init_default_io_ops(void)
{
pio_ops.f_inb = __inb;
pio_ops.f_outb = __outb;
pio_ops.f_outw = __outw;
}
/*
* Redirect port I/O operations via pio_ops callbacks.
* TDX guests override these callbacks with TDX-specific helpers.
*/
#define inb pio_ops.f_inb
#define outb pio_ops.f_outb
#define outw pio_ops.f_outw
#endif
......@@ -17,6 +17,8 @@
struct boot_params boot_params __attribute__((aligned(16)));
struct port_io_ops pio_ops;
char *HEAP = _end;
char *heap_end = _end; /* Default end of heap = no heap */
......@@ -133,6 +135,8 @@ static void init_heap(void)
void main(void)
{
init_default_io_ops();
/* First, copy the boot header into the "zeropage" */
copy_boot_params();
......
......@@ -4,3 +4,5 @@ KASAN_SANITIZE_core.o := n
CFLAGS_core.o += -fno-stack-protector
obj-y += core.o
obj-$(CONFIG_INTEL_TDX_GUEST) += tdx/
......@@ -18,7 +18,15 @@ static u64 cc_mask __ro_after_init;
static bool intel_cc_platform_has(enum cc_attr attr)
{
return false;
switch (attr) {
case CC_ATTR_GUEST_UNROLL_STRING_IO:
case CC_ATTR_HOTPLUG_DISABLED:
case CC_ATTR_GUEST_MEM_ENCRYPT:
case CC_ATTR_MEM_ENCRYPT:
return true;
default:
return false;
}
}
/*
......@@ -90,9 +98,18 @@ EXPORT_SYMBOL_GPL(cc_platform_has);
u64 cc_mkenc(u64 val)
{
/*
* Both AMD and Intel use a bit in the page table to indicate
* encryption status of the page.
*
* - for AMD, bit *set* means the page is encrypted
* - for Intel *clear* means encrypted.
*/
switch (vendor) {
case CC_VENDOR_AMD:
return val | cc_mask;
case CC_VENDOR_INTEL:
return val & ~cc_mask;
default:
return val;
}
......@@ -100,9 +117,12 @@ u64 cc_mkenc(u64 val)
u64 cc_mkdec(u64 val)
{
/* See comment in cc_mkenc() */
switch (vendor) {
case CC_VENDOR_AMD:
return val & ~cc_mask;
case CC_VENDOR_INTEL:
return val | cc_mask;
default:
return val;
}
......
# SPDX-License-Identifier: GPL-2.0
obj-y += tdx.o tdcall.o
/* SPDX-License-Identifier: GPL-2.0 */
#include <asm/asm-offsets.h>
#include <asm/asm.h>
#include <asm/frame.h>
#include <asm/unwind_hints.h>
#include <linux/linkage.h>
#include <linux/bits.h>
#include <linux/errno.h>
#include "../../virt/vmx/tdx/tdxcall.S"
/*
* Bitmasks of exposed registers (with VMM).
*/
#define TDX_R10 BIT(10)
#define TDX_R11 BIT(11)
#define TDX_R12 BIT(12)
#define TDX_R13 BIT(13)
#define TDX_R14 BIT(14)
#define TDX_R15 BIT(15)
/*
* These registers are clobbered to hold arguments for each
* TDVMCALL. They are safe to expose to the VMM.
* Each bit in this mask represents a register ID. Bit field
* details can be found in TDX GHCI specification, section
* titled "TDCALL [TDG.VP.VMCALL] leaf".
*/
#define TDVMCALL_EXPOSE_REGS_MASK ( TDX_R10 | TDX_R11 | \
TDX_R12 | TDX_R13 | \
TDX_R14 | TDX_R15 )
/*
* __tdx_module_call() - Used by TDX guests to request services from
* the TDX module (does not include VMM services) using TDCALL instruction.
*
* Transforms function call register arguments into the TDCALL register ABI.
* After TDCALL operation, TDX module output is saved in @out (if it is
* provided by the user).
*
*-------------------------------------------------------------------------
* TDCALL ABI:
*-------------------------------------------------------------------------
* Input Registers:
*
* RAX - TDCALL Leaf number.
* RCX,RDX,R8-R9 - TDCALL Leaf specific input registers.
*
* Output Registers:
*
* RAX - TDCALL instruction error code.
* RCX,RDX,R8-R11 - TDCALL Leaf specific output registers.
*
*-------------------------------------------------------------------------
*
* __tdx_module_call() function ABI:
*
* @fn (RDI) - TDCALL Leaf ID, moved to RAX
* @rcx (RSI) - Input parameter 1, moved to RCX
* @rdx (RDX) - Input parameter 2, moved to RDX
* @r8 (RCX) - Input parameter 3, moved to R8
* @r9 (R8) - Input parameter 4, moved to R9
*
* @out (R9) - struct tdx_module_output pointer
* stored temporarily in R12 (not
* shared with the TDX module). It
* can be NULL.
*
* Return status of TDCALL via RAX.
*/
SYM_FUNC_START(__tdx_module_call)
FRAME_BEGIN
TDX_MODULE_CALL host=0
FRAME_END
RET
SYM_FUNC_END(__tdx_module_call)
/*
* __tdx_hypercall() - Make hypercalls to a TDX VMM using TDVMCALL leaf
* of TDCALL instruction
*
* Transforms values in function call argument struct tdx_hypercall_args @args
* into the TDCALL register ABI. After TDCALL operation, VMM output is saved
* back in @args.
*
*-------------------------------------------------------------------------
* TD VMCALL ABI:
*-------------------------------------------------------------------------
*
* Input Registers:
*
* RAX - TDCALL instruction leaf number (0 - TDG.VP.VMCALL)
* RCX - BITMAP which controls which part of TD Guest GPR
* is passed as-is to the VMM and back.
* R10 - Set 0 to indicate TDCALL follows standard TDX ABI
* specification. Non zero value indicates vendor
* specific ABI.
* R11 - VMCALL sub function number
* RBX, RBP, RDI, RSI - Used to pass VMCALL sub function specific arguments.
* R8-R9, R12-R15 - Same as above.
*
* Output Registers:
*
* RAX - TDCALL instruction status (Not related to hypercall
* output).
* R10 - Hypercall output error code.
* R11-R15 - Hypercall sub function specific output values.
*
*-------------------------------------------------------------------------
*
* __tdx_hypercall() function ABI:
*
* @args (RDI) - struct tdx_hypercall_args for input and output
* @flags (RSI) - TDX_HCALL_* flags
*
* On successful completion, return the hypercall error code.
*/
SYM_FUNC_START(__tdx_hypercall)
FRAME_BEGIN
/* Save callee-saved GPRs as mandated by the x86_64 ABI */
push %r15
push %r14
push %r13
push %r12
/* Mangle function call ABI into TDCALL ABI: */
/* Set TDCALL leaf ID (TDVMCALL (0)) in RAX */
xor %eax, %eax
/* Copy hypercall registers from arg struct: */
movq TDX_HYPERCALL_r10(%rdi), %r10
movq TDX_HYPERCALL_r11(%rdi), %r11
movq TDX_HYPERCALL_r12(%rdi), %r12
movq TDX_HYPERCALL_r13(%rdi), %r13
movq TDX_HYPERCALL_r14(%rdi), %r14
movq TDX_HYPERCALL_r15(%rdi), %r15
movl $TDVMCALL_EXPOSE_REGS_MASK, %ecx
/*
* For the idle loop STI needs to be called directly before the TDCALL
* that enters idle (EXIT_REASON_HLT case). STI instruction enables
* interrupts only one instruction later. If there is a window between
* STI and the instruction that emulates the HALT state, there is a
* chance for interrupts to happen in this window, which can delay the
* HLT operation indefinitely. Since this is the not the desired
* result, conditionally call STI before TDCALL.
*/
testq $TDX_HCALL_ISSUE_STI, %rsi
jz .Lskip_sti
sti
.Lskip_sti:
tdcall
/*
* RAX==0 indicates a failure of the TDVMCALL mechanism itself and that
* something has gone horribly wrong with the TDX module.
*
* The return status of the hypercall operation is in a separate
* register (in R10). Hypercall errors are a part of normal operation
* and are handled by callers.
*/
testq %rax, %rax
jne .Lpanic
/* TDVMCALL leaf return code is in R10 */
movq %r10, %rax
/* Copy hypercall result registers to arg struct if needed */
testq $TDX_HCALL_HAS_OUTPUT, %rsi
jz .Lout
movq %r10, TDX_HYPERCALL_r10(%rdi)
movq %r11, TDX_HYPERCALL_r11(%rdi)
movq %r12, TDX_HYPERCALL_r12(%rdi)
movq %r13, TDX_HYPERCALL_r13(%rdi)
movq %r14, TDX_HYPERCALL_r14(%rdi)
movq %r15, TDX_HYPERCALL_r15(%rdi)
.Lout:
/*
* Zero out registers exposed to the VMM to avoid speculative execution
* with VMM-controlled values. This needs to include all registers
* present in TDVMCALL_EXPOSE_REGS_MASK (except R12-R15). R12-R15
* context will be restored.
*/
xor %r10d, %r10d
xor %r11d, %r11d
/* Restore callee-saved GPRs as mandated by the x86_64 ABI */
pop %r12
pop %r13
pop %r14
pop %r15
FRAME_END
RET
.Lpanic:
call __tdx_hypercall_failed
/* __tdx_hypercall_failed never returns */
REACHABLE
jmp .Lpanic
SYM_FUNC_END(__tdx_hypercall)
// SPDX-License-Identifier: GPL-2.0
/* Copyright (C) 2021-2022 Intel Corporation */
#undef pr_fmt
#define pr_fmt(fmt) "tdx: " fmt
#include <linux/cpufeature.h>
#include <asm/coco.h>
#include <asm/tdx.h>
#include <asm/vmx.h>
#include <asm/insn.h>
#include <asm/insn-eval.h>
#include <asm/pgtable.h>
/* TDX module Call Leaf IDs */
#define TDX_GET_INFO 1
#define TDX_GET_VEINFO 3
#define TDX_ACCEPT_PAGE 6
/* TDX hypercall Leaf IDs */
#define TDVMCALL_MAP_GPA 0x10001
/* MMIO direction */
#define EPT_READ 0
#define EPT_WRITE 1
/* Port I/O direction */
#define PORT_READ 0
#define PORT_WRITE 1
/* See Exit Qualification for I/O Instructions in VMX documentation */
#define VE_IS_IO_IN(e) ((e) & BIT(3))
#define VE_GET_IO_SIZE(e) (((e) & GENMASK(2, 0)) + 1)
#define VE_GET_PORT_NUM(e) ((e) >> 16)
#define VE_IS_IO_STRING(e) ((e) & BIT(4))
/*
* Wrapper for standard use of __tdx_hypercall with no output aside from
* return code.
*/
static inline u64 _tdx_hypercall(u64 fn, u64 r12, u64 r13, u64 r14, u64 r15)
{
struct tdx_hypercall_args args = {
.r10 = TDX_HYPERCALL_STANDARD,
.r11 = fn,
.r12 = r12,
.r13 = r13,
.r14 = r14,
.r15 = r15,
};
return __tdx_hypercall(&args, 0);
}
/* Called from __tdx_hypercall() for unrecoverable failure */
void __tdx_hypercall_failed(void)
{
panic("TDVMCALL failed. TDX module bug?");
}
/*
* The TDG.VP.VMCALL-Instruction-execution sub-functions are defined
* independently from but are currently matched 1:1 with VMX EXIT_REASONs.
* Reusing the KVM EXIT_REASON macros makes it easier to connect the host and
* guest sides of these calls.
*/
static u64 hcall_func(u64 exit_reason)
{
return exit_reason;
}
#ifdef CONFIG_KVM_GUEST
long tdx_kvm_hypercall(unsigned int nr, unsigned long p1, unsigned long p2,
unsigned long p3, unsigned long p4)
{
struct tdx_hypercall_args args = {
.r10 = nr,
.r11 = p1,
.r12 = p2,
.r13 = p3,
.r14 = p4,
};
return __tdx_hypercall(&args, 0);
}
EXPORT_SYMBOL_GPL(tdx_kvm_hypercall);
#endif
/*
* Used for TDX guests to make calls directly to the TD module. This
* should only be used for calls that have no legitimate reason to fail
* or where the kernel can not survive the call failing.
*/
static inline void tdx_module_call(u64 fn, u64 rcx, u64 rdx, u64 r8, u64 r9,
struct tdx_module_output *out)
{
if (__tdx_module_call(fn, rcx, rdx, r8, r9, out))
panic("TDCALL %lld failed (Buggy TDX module!)\n", fn);
}
static u64 get_cc_mask(void)
{
struct tdx_module_output out;
unsigned int gpa_width;
/*
* TDINFO TDX module call is used to get the TD execution environment
* information like GPA width, number of available vcpus, debug mode
* information, etc. More details about the ABI can be found in TDX
* Guest-Host-Communication Interface (GHCI), section 2.4.2 TDCALL
* [TDG.VP.INFO].
*
* The GPA width that comes out of this call is critical. TDX guests
* can not meaningfully run without it.
*/
tdx_module_call(TDX_GET_INFO, 0, 0, 0, 0, &out);
gpa_width = out.rcx & GENMASK(5, 0);
/*
* The highest bit of a guest physical address is the "sharing" bit.
* Set it for shared pages and clear it for private pages.
*/
return BIT_ULL(gpa_width - 1);
}
static u64 __cpuidle __halt(const bool irq_disabled, const bool do_sti)
{
struct tdx_hypercall_args args = {
.r10 = TDX_HYPERCALL_STANDARD,
.r11 = hcall_func(EXIT_REASON_HLT),
.r12 = irq_disabled,
};
/*
* Emulate HLT operation via hypercall. More info about ABI
* can be found in TDX Guest-Host-Communication Interface
* (GHCI), section 3.8 TDG.VP.VMCALL<Instruction.HLT>.
*
* The VMM uses the "IRQ disabled" param to understand IRQ
* enabled status (RFLAGS.IF) of the TD guest and to determine
* whether or not it should schedule the halted vCPU if an
* IRQ becomes pending. E.g. if IRQs are disabled, the VMM
* can keep the vCPU in virtual HLT, even if an IRQ is
* pending, without hanging/breaking the guest.
*/
return __tdx_hypercall(&args, do_sti ? TDX_HCALL_ISSUE_STI : 0);
}
static bool handle_halt(void)
{
/*
* Since non safe halt is mainly used in CPU offlining
* and the guest will always stay in the halt state, don't
* call the STI instruction (set do_sti as false).
*/
const bool irq_disabled = irqs_disabled();
const bool do_sti = false;
if (__halt(irq_disabled, do_sti))
return false;
return true;
}
void __cpuidle tdx_safe_halt(void)
{
/*
* For do_sti=true case, __tdx_hypercall() function enables
* interrupts using the STI instruction before the TDCALL. So
* set irq_disabled as false.
*/
const bool irq_disabled = false;
const bool do_sti = true;
/*
* Use WARN_ONCE() to report the failure.
*/
if (__halt(irq_disabled, do_sti))
WARN_ONCE(1, "HLT instruction emulation failed\n");
}
static bool read_msr(struct pt_regs *regs)
{
struct tdx_hypercall_args args = {
.r10 = TDX_HYPERCALL_STANDARD,
.r11 = hcall_func(EXIT_REASON_MSR_READ),
.r12 = regs->cx,
};
/*
* Emulate the MSR read via hypercall. More info about ABI
* can be found in TDX Guest-Host-Communication Interface
* (GHCI), section titled "TDG.VP.VMCALL<Instruction.RDMSR>".
*/
if (__tdx_hypercall(&args, TDX_HCALL_HAS_OUTPUT))
return false;
regs->ax = lower_32_bits(args.r11);
regs->dx = upper_32_bits(args.r11);
return true;
}
static bool write_msr(struct pt_regs *regs)
{
struct tdx_hypercall_args args = {
.r10 = TDX_HYPERCALL_STANDARD,
.r11 = hcall_func(EXIT_REASON_MSR_WRITE),
.r12 = regs->cx,
.r13 = (u64)regs->dx << 32 | regs->ax,
};
/*
* Emulate the MSR write via hypercall. More info about ABI
* can be found in TDX Guest-Host-Communication Interface
* (GHCI) section titled "TDG.VP.VMCALL<Instruction.WRMSR>".
*/
return !__tdx_hypercall(&args, 0);
}
static bool handle_cpuid(struct pt_regs *regs)
{
struct tdx_hypercall_args args = {
.r10 = TDX_HYPERCALL_STANDARD,
.r11 = hcall_func(EXIT_REASON_CPUID),
.r12 = regs->ax,
.r13 = regs->cx,
};
/*
* Only allow VMM to control range reserved for hypervisor
* communication.
*
* Return all-zeros for any CPUID outside the range. It matches CPU
* behaviour for non-supported leaf.
*/
if (regs->ax < 0x40000000 || regs->ax > 0x4FFFFFFF) {
regs->ax = regs->bx = regs->cx = regs->dx = 0;
return true;
}
/*
* Emulate the CPUID instruction via a hypercall. More info about
* ABI can be found in TDX Guest-Host-Communication Interface
* (GHCI), section titled "VP.VMCALL<Instruction.CPUID>".
*/
if (__tdx_hypercall(&args, TDX_HCALL_HAS_OUTPUT))
return false;
/*
* As per TDX GHCI CPUID ABI, r12-r15 registers contain contents of
* EAX, EBX, ECX, EDX registers after the CPUID instruction execution.
* So copy the register contents back to pt_regs.
*/
regs->ax = args.r12;
regs->bx = args.r13;
regs->cx = args.r14;
regs->dx = args.r15;
return true;
}
static bool mmio_read(int size, unsigned long addr, unsigned long *val)
{
struct tdx_hypercall_args args = {
.r10 = TDX_HYPERCALL_STANDARD,
.r11 = hcall_func(EXIT_REASON_EPT_VIOLATION),
.r12 = size,
.r13 = EPT_READ,
.r14 = addr,
.r15 = *val,
};
if (__tdx_hypercall(&args, TDX_HCALL_HAS_OUTPUT))
return false;
*val = args.r11;
return true;
}
static bool mmio_write(int size, unsigned long addr, unsigned long val)
{
return !_tdx_hypercall(hcall_func(EXIT_REASON_EPT_VIOLATION), size,
EPT_WRITE, addr, val);
}
static bool handle_mmio(struct pt_regs *regs, struct ve_info *ve)
{
char buffer[MAX_INSN_SIZE];
unsigned long *reg, val;
struct insn insn = {};
enum mmio_type mmio;
int size, extend_size;
u8 extend_val = 0;
/* Only in-kernel MMIO is supported */
if (WARN_ON_ONCE(user_mode(regs)))
return false;
if (copy_from_kernel_nofault(buffer, (void *)regs->ip, MAX_INSN_SIZE))
return false;
if (insn_decode(&insn, buffer, MAX_INSN_SIZE, INSN_MODE_64))
return false;
mmio = insn_decode_mmio(&insn, &size);
if (WARN_ON_ONCE(mmio == MMIO_DECODE_FAILED))
return false;
if (mmio != MMIO_WRITE_IMM && mmio != MMIO_MOVS) {
reg = insn_get_modrm_reg_ptr(&insn, regs);
if (!reg)
return false;
}
ve->instr_len = insn.length;
/* Handle writes first */
switch (mmio) {
case MMIO_WRITE:
memcpy(&val, reg, size);
return mmio_write(size, ve->gpa, val);
case MMIO_WRITE_IMM:
val = insn.immediate.value;
return mmio_write(size, ve->gpa, val);
case MMIO_READ:
case MMIO_READ_ZERO_EXTEND:
case MMIO_READ_SIGN_EXTEND:
/* Reads are handled below */
break;
case MMIO_MOVS:
case MMIO_DECODE_FAILED:
/*
* MMIO was accessed with an instruction that could not be
* decoded or handled properly. It was likely not using io.h
* helpers or accessed MMIO accidentally.
*/
return false;
default:
WARN_ONCE(1, "Unknown insn_decode_mmio() decode value?");
return false;
}
/* Handle reads */
if (!mmio_read(size, ve->gpa, &val))
return false;
switch (mmio) {
case MMIO_READ:
/* Zero-extend for 32-bit operation */
extend_size = size == 4 ? sizeof(*reg) : 0;
break;
case MMIO_READ_ZERO_EXTEND:
/* Zero extend based on operand size */
extend_size = insn.opnd_bytes;
break;
case MMIO_READ_SIGN_EXTEND:
/* Sign extend based on operand size */
extend_size = insn.opnd_bytes;
if (size == 1 && val & BIT(7))
extend_val = 0xFF;
else if (size > 1 && val & BIT(15))
extend_val = 0xFF;
break;
default:
/* All other cases has to be covered with the first switch() */
WARN_ON_ONCE(1);
return false;
}
if (extend_size)
memset(reg, extend_val, extend_size);
memcpy(reg, &val, size);
return true;
}
static bool handle_in(struct pt_regs *regs, int size, int port)
{
struct tdx_hypercall_args args = {
.r10 = TDX_HYPERCALL_STANDARD,
.r11 = hcall_func(EXIT_REASON_IO_INSTRUCTION),
.r12 = size,
.r13 = PORT_READ,
.r14 = port,
};
u64 mask = GENMASK(BITS_PER_BYTE * size, 0);
bool success;
/*
* Emulate the I/O read via hypercall. More info about ABI can be found
* in TDX Guest-Host-Communication Interface (GHCI) section titled
* "TDG.VP.VMCALL<Instruction.IO>".
*/
success = !__tdx_hypercall(&args, TDX_HCALL_HAS_OUTPUT);
/* Update part of the register affected by the emulated instruction */
regs->ax &= ~mask;
if (success)
regs->ax |= args.r11 & mask;
return success;
}
static bool handle_out(struct pt_regs *regs, int size, int port)
{
u64 mask = GENMASK(BITS_PER_BYTE * size, 0);
/*
* Emulate the I/O write via hypercall. More info about ABI can be found
* in TDX Guest-Host-Communication Interface (GHCI) section titled
* "TDG.VP.VMCALL<Instruction.IO>".
*/
return !_tdx_hypercall(hcall_func(EXIT_REASON_IO_INSTRUCTION), size,
PORT_WRITE, port, regs->ax & mask);
}
/*
* Emulate I/O using hypercall.
*
* Assumes the IO instruction was using ax, which is enforced
* by the standard io.h macros.
*
* Return True on success or False on failure.
*/
static bool handle_io(struct pt_regs *regs, u32 exit_qual)
{
int size, port;
bool in;
if (VE_IS_IO_STRING(exit_qual))
return false;
in = VE_IS_IO_IN(exit_qual);
size = VE_GET_IO_SIZE(exit_qual);
port = VE_GET_PORT_NUM(exit_qual);
if (in)
return handle_in(regs, size, port);
else
return handle_out(regs, size, port);
}
/*
* Early #VE exception handler. Only handles a subset of port I/O.
* Intended only for earlyprintk. If failed, return false.
*/
__init bool tdx_early_handle_ve(struct pt_regs *regs)
{
struct ve_info ve;
tdx_get_ve_info(&ve);
if (ve.exit_reason != EXIT_REASON_IO_INSTRUCTION)
return false;
return handle_io(regs, ve.exit_qual);
}
void tdx_get_ve_info(struct ve_info *ve)
{
struct tdx_module_output out;
/*
* Called during #VE handling to retrieve the #VE info from the
* TDX module.
*
* This has to be called early in #VE handling. A "nested" #VE which
* occurs before this will raise a #DF and is not recoverable.
*
* The call retrieves the #VE info from the TDX module, which also
* clears the "#VE valid" flag. This must be done before anything else
* because any #VE that occurs while the valid flag is set will lead to
* #DF.
*
* Note, the TDX module treats virtual NMIs as inhibited if the #VE
* valid flag is set. It means that NMI=>#VE will not result in a #DF.
*/
tdx_module_call(TDX_GET_VEINFO, 0, 0, 0, 0, &out);
/* Transfer the output parameters */
ve->exit_reason = out.rcx;
ve->exit_qual = out.rdx;
ve->gla = out.r8;
ve->gpa = out.r9;
ve->instr_len = lower_32_bits(out.r10);
ve->instr_info = upper_32_bits(out.r10);
}
/* Handle the user initiated #VE */
static bool virt_exception_user(struct pt_regs *regs, struct ve_info *ve)
{
switch (ve->exit_reason) {
case EXIT_REASON_CPUID:
return handle_cpuid(regs);
default:
pr_warn("Unexpected #VE: %lld\n", ve->exit_reason);
return false;
}
}
/* Handle the kernel #VE */
static bool virt_exception_kernel(struct pt_regs *regs, struct ve_info *ve)
{
switch (ve->exit_reason) {
case EXIT_REASON_HLT:
return handle_halt();
case EXIT_REASON_MSR_READ:
return read_msr(regs);
case EXIT_REASON_MSR_WRITE:
return write_msr(regs);
case EXIT_REASON_CPUID:
return handle_cpuid(regs);
case EXIT_REASON_EPT_VIOLATION:
return handle_mmio(regs, ve);
case EXIT_REASON_IO_INSTRUCTION:
return handle_io(regs, ve->exit_qual);
default:
pr_warn("Unexpected #VE: %lld\n", ve->exit_reason);
return false;
}
}
bool tdx_handle_virt_exception(struct pt_regs *regs, struct ve_info *ve)
{
bool ret;
if (user_mode(regs))
ret = virt_exception_user(regs, ve);
else
ret = virt_exception_kernel(regs, ve);
/* After successful #VE handling, move the IP */
if (ret)
regs->ip += ve->instr_len;
return ret;
}
static bool tdx_tlb_flush_required(bool private)
{
/*
* TDX guest is responsible for flushing TLB on private->shared
* transition. VMM is responsible for flushing on shared->private.
*
* The VMM _can't_ flush private addresses as it can't generate PAs
* with the guest's HKID. Shared memory isn't subject to integrity
* checking, i.e. the VMM doesn't need to flush for its own protection.
*
* There's no need to flush when converting from shared to private,
* as flushing is the VMM's responsibility in this case, e.g. it must
* flush to avoid integrity failures in the face of a buggy or
* malicious guest.
*/
return !private;
}
static bool tdx_cache_flush_required(void)
{
/*
* AMD SME/SEV can avoid cache flushing if HW enforces cache coherence.
* TDX doesn't have such capability.
*
* Flush cache unconditionally.
*/
return true;
}
static bool try_accept_one(phys_addr_t *start, unsigned long len,
enum pg_level pg_level)
{
unsigned long accept_size = page_level_size(pg_level);
u64 tdcall_rcx;
u8 page_size;
if (!IS_ALIGNED(*start, accept_size))
return false;
if (len < accept_size)
return false;
/*
* Pass the page physical address to the TDX module to accept the
* pending, private page.
*
* Bits 2:0 of RCX encode page size: 0 - 4K, 1 - 2M, 2 - 1G.
*/
switch (pg_level) {
case PG_LEVEL_4K:
page_size = 0;
break;
case PG_LEVEL_2M:
page_size = 1;
break;
case PG_LEVEL_1G:
page_size = 2;
break;
default:
return false;
}
tdcall_rcx = *start | page_size;
if (__tdx_module_call(TDX_ACCEPT_PAGE, tdcall_rcx, 0, 0, 0, NULL))
return false;
*start += accept_size;
return true;
}
/*
* Inform the VMM of the guest's intent for this physical page: shared with
* the VMM or private to the guest. The VMM is expected to change its mapping
* of the page in response.
*/
static bool tdx_enc_status_changed(unsigned long vaddr, int numpages, bool enc)
{
phys_addr_t start = __pa(vaddr);
phys_addr_t end = __pa(vaddr + numpages * PAGE_SIZE);
if (!enc) {
/* Set the shared (decrypted) bits: */
start |= cc_mkdec(0);
end |= cc_mkdec(0);
}
/*
* Notify the VMM about page mapping conversion. More info about ABI
* can be found in TDX Guest-Host-Communication Interface (GHCI),
* section "TDG.VP.VMCALL<MapGPA>"
*/
if (_tdx_hypercall(TDVMCALL_MAP_GPA, start, end - start, 0, 0))
return false;
/* private->shared conversion requires only MapGPA call */
if (!enc)
return true;
/*
* For shared->private conversion, accept the page using
* TDX_ACCEPT_PAGE TDX module call.
*/
while (start < end) {
unsigned long len = end - start;
/*
* Try larger accepts first. It gives chance to VMM to keep
* 1G/2M SEPT entries where possible and speeds up process by
* cutting number of hypercalls (if successful).
*/
if (try_accept_one(&start, len, PG_LEVEL_1G))
continue;
if (try_accept_one(&start, len, PG_LEVEL_2M))
continue;
if (!try_accept_one(&start, len, PG_LEVEL_4K))
return false;
}
return true;
}
void __init tdx_early_init(void)
{
u64 cc_mask;
u32 eax, sig[3];
cpuid_count(TDX_CPUID_LEAF_ID, 0, &eax, &sig[0], &sig[2], &sig[1]);
if (memcmp(TDX_IDENT, sig, sizeof(sig)))
return;
setup_force_cpu_cap(X86_FEATURE_TDX_GUEST);
cc_set_vendor(CC_VENDOR_INTEL);
cc_mask = get_cc_mask();
cc_set_mask(cc_mask);
/*
* All bits above GPA width are reserved and kernel treats shared bit
* as flag, not as part of physical address.
*
* Adjust physical mask to only cover valid GPA bits.
*/
physical_mask &= cc_mask - 1;
x86_platform.guest.enc_cache_flush_required = tdx_cache_flush_required;
x86_platform.guest.enc_tlb_flush_required = tdx_tlb_flush_required;
x86_platform.guest.enc_status_change_finish = tdx_enc_status_changed;
pr_info("Guest detected\n");
}
......@@ -13,7 +13,19 @@
/* Asm macros */
#define ACPI_FLUSH_CPU_CACHE() wbinvd()
/*
* ACPI_FLUSH_CPU_CACHE() flushes caches on entering sleep states.
* It is required to prevent data loss.
*
* While running inside virtual machine, the kernel can bypass cache flushing.
* Changing sleep state in a virtual machine doesn't affect the host system
* sleep state and cannot lead to data loss.
*/
#define ACPI_FLUSH_CPU_CACHE() \
do { \
if (!cpu_feature_enabled(X86_FEATURE_HYPERVISOR)) \
wbinvd(); \
} while (0)
int __acpi_acquire_global_lock(unsigned int *lock);
int __acpi_release_global_lock(unsigned int *lock);
......
......@@ -328,6 +328,8 @@ struct apic {
/* wakeup_secondary_cpu */
int (*wakeup_secondary_cpu)(int apicid, unsigned long start_eip);
/* wakeup secondary CPU using 64-bit wakeup point */
int (*wakeup_secondary_cpu_64)(int apicid, unsigned long start_eip);
void (*inquire_remote_apic)(int apicid);
......@@ -488,6 +490,11 @@ static inline unsigned int read_apic_id(void)
return apic->get_apic_id(reg);
}
#ifdef CONFIG_X86_64
typedef int (*wakeup_cpu_handler)(int apicid, unsigned long start_eip);
extern void acpi_wake_cpu_handler_update(wakeup_cpu_handler handler);
#endif
extern int default_apic_id_valid(u32 apicid);
extern int default_acpi_madt_oem_check(char *, char *);
extern void default_setup_apic_routing(void);
......
......@@ -238,6 +238,7 @@
#define X86_FEATURE_VMW_VMMCALL ( 8*32+19) /* "" VMware prefers VMMCALL hypercall instruction */
#define X86_FEATURE_PVUNLOCK ( 8*32+20) /* "" PV unlock function */
#define X86_FEATURE_VCPUPREEMPT ( 8*32+21) /* "" PV vcpu_is_preempted function */
#define X86_FEATURE_TDX_GUEST ( 8*32+22) /* Intel Trust Domain Extensions Guest */
/* Intel-defined CPU features, CPUID level 0x00000007:0 (EBX), word 9 */
#define X86_FEATURE_FSGSBASE ( 9*32+ 0) /* RDFSBASE, WRFSBASE, RDGSBASE, WRGSBASE instructions*/
......
......@@ -68,6 +68,12 @@
# define DISABLE_SGX (1 << (X86_FEATURE_SGX & 31))
#endif
#ifdef CONFIG_INTEL_TDX_GUEST
# define DISABLE_TDX_GUEST 0
#else
# define DISABLE_TDX_GUEST (1 << (X86_FEATURE_TDX_GUEST & 31))
#endif
/*
* Make sure to add features to the correct mask
*/
......@@ -79,7 +85,7 @@
#define DISABLED_MASK5 0
#define DISABLED_MASK6 0
#define DISABLED_MASK7 (DISABLE_PTI)
#define DISABLED_MASK8 0
#define DISABLED_MASK8 (DISABLE_TDX_GUEST)
#define DISABLED_MASK9 (DISABLE_SMAP|DISABLE_SGX)
#define DISABLED_MASK10 0
#define DISABLED_MASK11 0
......
......@@ -632,6 +632,10 @@ DECLARE_IDTENTRY_XENCB(X86_TRAP_OTHER, exc_xen_hypervisor_callback);
DECLARE_IDTENTRY_RAW(X86_TRAP_OTHER, exc_xen_unknown_trap);
#endif
#ifdef CONFIG_INTEL_TDX_GUEST
DECLARE_IDTENTRY(X86_TRAP_VE, exc_virtualization_exception);
#endif
/* Device interrupts common/spurious */
DECLARE_IDTENTRY_IRQ(X86_TRAP_OTHER, common_interrupt);
#ifdef CONFIG_X86_LOCAL_APIC
......
......@@ -44,6 +44,7 @@
#include <asm/page.h>
#include <asm/early_ioremap.h>
#include <asm/pgtable_types.h>
#include <asm/shared/io.h>
#define build_mmio_read(name, size, type, reg, barrier) \
static inline type name(const volatile void __iomem *addr) \
......@@ -256,37 +257,23 @@ static inline void slow_down_io(void)
#endif
#define BUILDIO(bwl, bw, type) \
static inline void out##bwl(unsigned type value, int port) \
{ \
asm volatile("out" #bwl " %" #bw "0, %w1" \
: : "a"(value), "Nd"(port)); \
} \
\
static inline unsigned type in##bwl(int port) \
{ \
unsigned type value; \
asm volatile("in" #bwl " %w1, %" #bw "0" \
: "=a"(value) : "Nd"(port)); \
return value; \
} \
\
static inline void out##bwl##_p(unsigned type value, int port) \
static inline void out##bwl##_p(type value, u16 port) \
{ \
out##bwl(value, port); \
slow_down_io(); \
} \
\
static inline unsigned type in##bwl##_p(int port) \
static inline type in##bwl##_p(u16 port) \
{ \
unsigned type value = in##bwl(port); \
type value = in##bwl(port); \
slow_down_io(); \
return value; \
} \
\
static inline void outs##bwl(int port, const void *addr, unsigned long count) \
static inline void outs##bwl(u16 port, const void *addr, unsigned long count) \
{ \
if (cc_platform_has(CC_ATTR_GUEST_UNROLL_STRING_IO)) { \
unsigned type *value = (unsigned type *)addr; \
type *value = (type *)addr; \
while (count) { \
out##bwl(*value, port); \
value++; \
......@@ -299,10 +286,10 @@ static inline void outs##bwl(int port, const void *addr, unsigned long count) \
} \
} \
\
static inline void ins##bwl(int port, void *addr, unsigned long count) \
static inline void ins##bwl(u16 port, void *addr, unsigned long count) \
{ \
if (cc_platform_has(CC_ATTR_GUEST_UNROLL_STRING_IO)) { \
unsigned type *value = (unsigned type *)addr; \
type *value = (type *)addr; \
while (count) { \
*value = in##bwl(port); \
value++; \
......@@ -315,13 +302,11 @@ static inline void ins##bwl(int port, void *addr, unsigned long count) \
} \
}
BUILDIO(b, b, char)
BUILDIO(w, w, short)
BUILDIO(l, , int)
BUILDIO(b, b, u8)
BUILDIO(w, w, u16)
BUILDIO(l, , u32)
#undef BUILDIO
#define inb inb
#define inw inw
#define inl inl
#define inb_p inb_p
#define inw_p inw_p
#define inl_p inl_p
......@@ -329,9 +314,6 @@ BUILDIO(l, , int)
#define insw insw
#define insl insl
#define outb outb
#define outw outw
#define outl outl
#define outb_p outb_p
#define outw_p outw_p
#define outl_p outl_p
......
......@@ -7,6 +7,8 @@
#include <linux/interrupt.h>
#include <uapi/asm/kvm_para.h>
#include <asm/tdx.h>
#ifdef CONFIG_KVM_GUEST
bool kvm_check_and_clear_guest_paused(void);
#else
......@@ -32,6 +34,10 @@ static inline bool kvm_check_and_clear_guest_paused(void)
static inline long kvm_hypercall0(unsigned int nr)
{
long ret;
if (cpu_feature_enabled(X86_FEATURE_TDX_GUEST))
return tdx_kvm_hypercall(nr, 0, 0, 0, 0);
asm volatile(KVM_HYPERCALL
: "=a"(ret)
: "a"(nr)
......@@ -42,6 +48,10 @@ static inline long kvm_hypercall0(unsigned int nr)
static inline long kvm_hypercall1(unsigned int nr, unsigned long p1)
{
long ret;
if (cpu_feature_enabled(X86_FEATURE_TDX_GUEST))
return tdx_kvm_hypercall(nr, p1, 0, 0, 0);
asm volatile(KVM_HYPERCALL
: "=a"(ret)
: "a"(nr), "b"(p1)
......@@ -53,6 +63,10 @@ static inline long kvm_hypercall2(unsigned int nr, unsigned long p1,
unsigned long p2)
{
long ret;
if (cpu_feature_enabled(X86_FEATURE_TDX_GUEST))
return tdx_kvm_hypercall(nr, p1, p2, 0, 0);
asm volatile(KVM_HYPERCALL
: "=a"(ret)
: "a"(nr), "b"(p1), "c"(p2)
......@@ -64,6 +78,10 @@ static inline long kvm_hypercall3(unsigned int nr, unsigned long p1,
unsigned long p2, unsigned long p3)
{
long ret;
if (cpu_feature_enabled(X86_FEATURE_TDX_GUEST))
return tdx_kvm_hypercall(nr, p1, p2, p3, 0);
asm volatile(KVM_HYPERCALL
: "=a"(ret)
: "a"(nr), "b"(p1), "c"(p2), "d"(p3)
......@@ -76,6 +94,10 @@ static inline long kvm_hypercall4(unsigned int nr, unsigned long p1,
unsigned long p4)
{
long ret;
if (cpu_feature_enabled(X86_FEATURE_TDX_GUEST))
return tdx_kvm_hypercall(nr, p1, p2, p3, p4);
asm volatile(KVM_HYPERCALL
: "=a"(ret)
: "a"(nr), "b"(p1), "c"(p2), "d"(p3), "S"(p4)
......
......@@ -49,9 +49,6 @@ void __init early_set_mem_enc_dec_hypercall(unsigned long vaddr, int npages,
void __init mem_encrypt_free_decrypted_mem(void);
/* Architecture __weak replacement functions */
void __init mem_encrypt_init(void);
void __init sev_es_init_vc_handling(void);
#define __bss_decrypted __section(".bss..decrypted")
......@@ -89,6 +86,9 @@ static inline void mem_encrypt_free_decrypted_mem(void) { }
#endif /* CONFIG_AMD_MEM_ENCRYPT */
/* Architecture __weak replacement functions */
void __init mem_encrypt_init(void);
/*
* The __sme_pa() and __sme_pa_nodebug() macros are meant for use when
* writing to or comparing values from the cr3 register. Having the
......
......@@ -25,6 +25,7 @@ struct real_mode_header {
u32 sev_es_trampoline_start;
#endif
#ifdef CONFIG_X86_64
u32 trampoline_start64;
u32 trampoline_pgd;
#endif
/* ACPI S3 wakeup */
......
/* SPDX-License-Identifier: GPL-2.0 */
#ifndef _ASM_X86_SHARED_IO_H
#define _ASM_X86_SHARED_IO_H
#include <linux/types.h>
#define BUILDIO(bwl, bw, type) \
static inline void __out##bwl(type value, u16 port) \
{ \
asm volatile("out" #bwl " %" #bw "0, %w1" \
: : "a"(value), "Nd"(port)); \
} \
\
static inline type __in##bwl(u16 port) \
{ \
type value; \
asm volatile("in" #bwl " %w1, %" #bw "0" \
: "=a"(value) : "Nd"(port)); \
return value; \
}
BUILDIO(b, b, u8)
BUILDIO(w, w, u16)
BUILDIO(l, , u32)
#undef BUILDIO
#define inb __inb
#define inw __inw
#define inl __inl
#define outb __outb
#define outw __outw
#define outl __outl
#endif
/* SPDX-License-Identifier: GPL-2.0 */
#ifndef _ASM_X86_SHARED_TDX_H
#define _ASM_X86_SHARED_TDX_H
#include <linux/bits.h>
#include <linux/types.h>
#define TDX_HYPERCALL_STANDARD 0
#define TDX_HCALL_HAS_OUTPUT BIT(0)
#define TDX_HCALL_ISSUE_STI BIT(1)
#define TDX_CPUID_LEAF_ID 0x21
#define TDX_IDENT "IntelTDX "
#ifndef __ASSEMBLY__
/*
* Used in __tdx_hypercall() to pass down and get back registers' values of
* the TDCALL instruction when requesting services from the VMM.
*
* This is a software only structure and not part of the TDX module/VMM ABI.
*/
struct tdx_hypercall_args {
u64 r10;
u64 r11;
u64 r12;
u64 r13;
u64 r14;
u64 r15;
};
/* Used to request services from the VMM */
u64 __tdx_hypercall(struct tdx_hypercall_args *args, unsigned long flags);
/* Called from __tdx_hypercall() for unrecoverable failure */
void __tdx_hypercall_failed(void);
#endif /* !__ASSEMBLY__ */
#endif /* _ASM_X86_SHARED_TDX_H */
/* SPDX-License-Identifier: GPL-2.0 */
/* Copyright (C) 2021-2022 Intel Corporation */
#ifndef _ASM_X86_TDX_H
#define _ASM_X86_TDX_H
#include <linux/init.h>
#include <linux/bits.h>
#include <asm/ptrace.h>
#include <asm/shared/tdx.h>
/*
* SW-defined error codes.
*
* Bits 47:40 == 0xFF indicate Reserved status code class that never used by
* TDX module.
*/
#define TDX_ERROR _BITUL(63)
#define TDX_SW_ERROR (TDX_ERROR | GENMASK_ULL(47, 40))
#define TDX_SEAMCALL_VMFAILINVALID (TDX_SW_ERROR | _UL(0xFFFF0000))
#ifndef __ASSEMBLY__
/*
* Used to gather the output registers values of the TDCALL and SEAMCALL
* instructions when requesting services from the TDX module.
*
* This is a software only structure and not part of the TDX module/VMM ABI.
*/
struct tdx_module_output {
u64 rcx;
u64 rdx;
u64 r8;
u64 r9;
u64 r10;
u64 r11;
};
/*
* Used by the #VE exception handler to gather the #VE exception
* info from the TDX module. This is a software only structure
* and not part of the TDX module/VMM ABI.
*/
struct ve_info {
u64 exit_reason;
u64 exit_qual;
/* Guest Linear (virtual) Address */
u64 gla;
/* Guest Physical Address */
u64 gpa;
u32 instr_len;
u32 instr_info;
};
#ifdef CONFIG_INTEL_TDX_GUEST
void __init tdx_early_init(void);
/* Used to communicate with the TDX module */
u64 __tdx_module_call(u64 fn, u64 rcx, u64 rdx, u64 r8, u64 r9,
struct tdx_module_output *out);
void tdx_get_ve_info(struct ve_info *ve);
bool tdx_handle_virt_exception(struct pt_regs *regs, struct ve_info *ve);
void tdx_safe_halt(void);
bool tdx_early_handle_ve(struct pt_regs *regs);
#else
static inline void tdx_early_init(void) { };
static inline void tdx_safe_halt(void) { };
static inline bool tdx_early_handle_ve(struct pt_regs *regs) { return false; }
#endif /* CONFIG_INTEL_TDX_GUEST */
#if defined(CONFIG_KVM_GUEST) && defined(CONFIG_INTEL_TDX_GUEST)
long tdx_kvm_hypercall(unsigned int nr, unsigned long p1, unsigned long p2,
unsigned long p3, unsigned long p4);
#else
static inline long tdx_kvm_hypercall(unsigned int nr, unsigned long p1,
unsigned long p2, unsigned long p3,
unsigned long p4)
{
return -ENODEV;
}
#endif /* CONFIG_INTEL_TDX_GUEST && CONFIG_KVM_GUEST */
#endif /* !__ASSEMBLY__ */
#endif /* _ASM_X86_TDX_H */
......@@ -65,6 +65,13 @@ static u64 acpi_lapic_addr __initdata = APIC_DEFAULT_PHYS_BASE;
static bool acpi_support_online_capable;
#endif
#ifdef CONFIG_X86_64
/* Physical address of the Multiprocessor Wakeup Structure mailbox */
static u64 acpi_mp_wake_mailbox_paddr;
/* Virtual address of the Multiprocessor Wakeup Structure mailbox */
static struct acpi_madt_multiproc_wakeup_mailbox *acpi_mp_wake_mailbox;
#endif
#ifdef CONFIG_X86_IO_APIC
/*
* Locks related to IOAPIC hotplug
......@@ -336,7 +343,60 @@ acpi_parse_lapic_nmi(union acpi_subtable_headers * header, const unsigned long e
return 0;
}
#endif /*CONFIG_X86_LOCAL_APIC */
#ifdef CONFIG_X86_64
static int acpi_wakeup_cpu(int apicid, unsigned long start_ip)
{
/*
* Remap mailbox memory only for the first call to acpi_wakeup_cpu().
*
* Wakeup of secondary CPUs is fully serialized in the core code.
* No need to protect acpi_mp_wake_mailbox from concurrent accesses.
*/
if (!acpi_mp_wake_mailbox) {
acpi_mp_wake_mailbox = memremap(acpi_mp_wake_mailbox_paddr,
sizeof(*acpi_mp_wake_mailbox),
MEMREMAP_WB);
}
/*
* Mailbox memory is shared between the firmware and OS. Firmware will
* listen on mailbox command address, and once it receives the wakeup
* command, the CPU associated with the given apicid will be booted.
*
* The value of 'apic_id' and 'wakeup_vector' must be visible to the
* firmware before the wakeup command is visible. smp_store_release()
* ensures ordering and visibility.
*/
acpi_mp_wake_mailbox->apic_id = apicid;
acpi_mp_wake_mailbox->wakeup_vector = start_ip;
smp_store_release(&acpi_mp_wake_mailbox->command,
ACPI_MP_WAKE_COMMAND_WAKEUP);
/*
* Wait for the CPU to wake up.
*
* The CPU being woken up is essentially in a spin loop waiting to be
* woken up. It should not take long for it wake up and acknowledge by
* zeroing out ->command.
*
* ACPI specification doesn't provide any guidance on how long kernel
* has to wait for a wake up acknowledgement. It also doesn't provide
* a way to cancel a wake up request if it takes too long.
*
* In TDX environment, the VMM has control over how long it takes to
* wake up secondary. It can postpone scheduling secondary vCPU
* indefinitely. Giving up on wake up request and reporting error opens
* possible attack vector for VMM: it can wake up a secondary CPU when
* kernel doesn't expect it. Wait until positive result of the wake up
* request.
*/
while (READ_ONCE(acpi_mp_wake_mailbox->command))
cpu_relax();
return 0;
}
#endif /* CONFIG_X86_64 */
#endif /* CONFIG_X86_LOCAL_APIC */
#ifdef CONFIG_X86_IO_APIC
#define MP_ISA_BUS 0
......@@ -1083,6 +1143,29 @@ static int __init acpi_parse_madt_lapic_entries(void)
}
return 0;
}
#ifdef CONFIG_X86_64
static int __init acpi_parse_mp_wake(union acpi_subtable_headers *header,
const unsigned long end)
{
struct acpi_madt_multiproc_wakeup *mp_wake;
if (!IS_ENABLED(CONFIG_SMP))
return -ENODEV;
mp_wake = (struct acpi_madt_multiproc_wakeup *)header;
if (BAD_MADT_ENTRY(mp_wake, end))
return -EINVAL;
acpi_table_print_madt_entry(&header->common);
acpi_mp_wake_mailbox_paddr = mp_wake->base_address;
acpi_wake_cpu_handler_update(acpi_wakeup_cpu);
return 0;
}
#endif /* CONFIG_X86_64 */
#endif /* CONFIG_X86_LOCAL_APIC */
#ifdef CONFIG_X86_IO_APIC
......@@ -1278,6 +1361,14 @@ static void __init acpi_process_madt(void)
smp_found_config = 1;
}
#ifdef CONFIG_X86_64
/*
* Parse MADT MP Wake entry.
*/
acpi_table_parse_madt(ACPI_MADT_TYPE_MULTIPROC_WAKEUP,
acpi_parse_mp_wake, 1);
#endif
}
if (error == -EINVAL) {
/*
......
......@@ -2551,6 +2551,16 @@ u32 x86_msi_msg_get_destid(struct msi_msg *msg, bool extid)
}
EXPORT_SYMBOL_GPL(x86_msi_msg_get_destid);
#ifdef CONFIG_X86_64
void __init acpi_wake_cpu_handler_update(wakeup_cpu_handler handler)
{
struct apic **drv;
for (drv = __apicdrivers; drv < __apicdrivers_end; drv++)
(*drv)->wakeup_secondary_cpu_64 = handler;
}
#endif
/*
* Override the generic EOI implementation with an optimized version.
* Only called during early boot when only one CPU is active and with
......
......@@ -65,6 +65,7 @@
#include <asm/irq_remapping.h>
#include <asm/hw_irq.h>
#include <asm/apic.h>
#include <asm/pgtable.h>
#define for_each_ioapic(idx) \
for ((idx) = 0; (idx) < nr_ioapics; (idx)++)
......@@ -2677,6 +2678,19 @@ static struct resource * __init ioapic_setup_resources(void)
return res;
}
static void io_apic_set_fixmap(enum fixed_addresses idx, phys_addr_t phys)
{
pgprot_t flags = FIXMAP_PAGE_NOCACHE;
/*
* Ensure fixmaps for IOAPIC MMIO respect memory encryption pgprot
* bits, just like normal ioremap():
*/
flags = pgprot_decrypted(flags);
__set_fixmap(idx, phys, flags);
}
void __init io_apic_init_mappings(void)
{
unsigned long ioapic_phys, idx = FIX_IO_APIC_BASE_0;
......@@ -2709,7 +2723,7 @@ void __init io_apic_init_mappings(void)
__func__, PAGE_SIZE, PAGE_SIZE);
ioapic_phys = __pa(ioapic_phys);
}
set_fixmap_nocache(idx, ioapic_phys);
io_apic_set_fixmap(idx, ioapic_phys);
apic_printk(APIC_VERBOSE, "mapped IOAPIC to %08lx (%08lx)\n",
__fix_to_virt(idx) + (ioapic_phys & ~PAGE_MASK),
ioapic_phys);
......@@ -2838,7 +2852,7 @@ int mp_register_ioapic(int id, u32 address, u32 gsi_base,
ioapics[idx].mp_config.flags = MPC_APIC_USABLE;
ioapics[idx].mp_config.apicaddr = address;
set_fixmap_nocache(FIX_IO_APIC_BASE_0 + idx, address);
io_apic_set_fixmap(FIX_IO_APIC_BASE_0 + idx, address);
if (bad_ioapic_register(idx)) {
clear_fixmap(FIX_IO_APIC_BASE_0 + idx);
return -ENODEV;
......
......@@ -18,6 +18,7 @@
#include <asm/bootparam.h>
#include <asm/suspend.h>
#include <asm/tlbflush.h>
#include <asm/tdx.h>
#ifdef CONFIG_XEN
#include <xen/interface/xen.h>
......@@ -65,6 +66,22 @@ static void __used common(void)
OFFSET(XEN_vcpu_info_arch_cr2, vcpu_info, arch.cr2);
#endif
BLANK();
OFFSET(TDX_MODULE_rcx, tdx_module_output, rcx);
OFFSET(TDX_MODULE_rdx, tdx_module_output, rdx);
OFFSET(TDX_MODULE_r8, tdx_module_output, r8);
OFFSET(TDX_MODULE_r9, tdx_module_output, r9);
OFFSET(TDX_MODULE_r10, tdx_module_output, r10);
OFFSET(TDX_MODULE_r11, tdx_module_output, r11);
BLANK();
OFFSET(TDX_HYPERCALL_r10, tdx_hypercall_args, r10);
OFFSET(TDX_HYPERCALL_r11, tdx_hypercall_args, r11);
OFFSET(TDX_HYPERCALL_r12, tdx_hypercall_args, r12);
OFFSET(TDX_HYPERCALL_r13, tdx_hypercall_args, r13);
OFFSET(TDX_HYPERCALL_r14, tdx_hypercall_args, r14);
OFFSET(TDX_HYPERCALL_r15, tdx_hypercall_args, r15);
BLANK();
OFFSET(BP_scratch, boot_params, scratch);
OFFSET(BP_secure_boot, boot_params, secure_boot);
......
......@@ -40,6 +40,7 @@
#include <asm/extable.h>
#include <asm/trapnr.h>
#include <asm/sev.h>
#include <asm/tdx.h>
/*
* Manage page tables very early on.
......@@ -417,6 +418,9 @@ void __init do_early_exception(struct pt_regs *regs, int trapnr)
trapnr == X86_TRAP_VC && handle_vc_boot_ghcb(regs))
return;
if (trapnr == X86_TRAP_VE && tdx_early_handle_ve(regs))
return;
early_fixup_exception(regs, trapnr);
}
......@@ -515,6 +519,9 @@ asmlinkage __visible void __init x86_64_start_kernel(char * real_mode_data)
idt_setup_early_handler();
/* Needed before cc_platform_has() can be used for TDX */
tdx_early_init();
copy_bootdata(__va(real_mode_data));
/*
......
......@@ -173,8 +173,22 @@ SYM_INNER_LABEL(secondary_startup_64_no_verify, SYM_L_GLOBAL)
addq $(init_top_pgt - __START_KERNEL_map), %rax
1:
#ifdef CONFIG_X86_MCE
/*
* Preserve CR4.MCE if the kernel will enable #MC support.
* Clearing MCE may fault in some environments (that also force #MC
* support). Any machine check that occurs before #MC support is fully
* configured will crash the system regardless of the CR4.MCE value set
* here.
*/
movq %cr4, %rcx
andl $X86_CR4_MCE, %ecx
#else
movl $0, %ecx
#endif
/* Enable PAE mode, PGE and LA57 */
movl $(X86_CR4_PAE | X86_CR4_PGE), %ecx
orl $(X86_CR4_PAE | X86_CR4_PGE), %ecx
#ifdef CONFIG_X86_5LEVEL
testl $1, __pgtable_l5_enabled(%rip)
jz 1f
......@@ -280,13 +294,23 @@ SYM_INNER_LABEL(secondary_startup_64_no_verify, SYM_L_GLOBAL)
/* Setup EFER (Extended Feature Enable Register) */
movl $MSR_EFER, %ecx
rdmsr
/*
* Preserve current value of EFER for comparison and to skip
* EFER writes if no change was made (for TDX guest)
*/
movl %eax, %edx
btsl $_EFER_SCE, %eax /* Enable System Call */
btl $20,%edi /* No Execute supported? */
jnc 1f
btsl $_EFER_NX, %eax
btsq $_PAGE_BIT_NX,early_pmd_flags(%rip)
1: wrmsr /* Make changes effective */
/* Avoid writing EFER if no change was made (for TDX guest) */
1: cmpl %edx, %eax
je 1f
xor %edx, %edx
wrmsr /* Make changes effective */
1:
/* Setup cr0 */
movl $CR0_STATE, %eax
/* Make changes effective */
......
......@@ -69,6 +69,9 @@ static const __initconst struct idt_data early_idts[] = {
*/
INTG(X86_TRAP_PF, asm_exc_page_fault),
#endif
#ifdef CONFIG_INTEL_TDX_GUEST
INTG(X86_TRAP_VE, asm_exc_virtualization_exception),
#endif
};
/*
......
......@@ -46,6 +46,7 @@
#include <asm/proto.h>
#include <asm/frame.h>
#include <asm/unwind.h>
#include <asm/tdx.h>
#include "process.h"
......@@ -873,6 +874,9 @@ void select_idle_routine(const struct cpuinfo_x86 *c)
} else if (prefer_mwait_c1_over_halt(c)) {
pr_info("using mwait in idle threads\n");
x86_idle = mwait_idle;
} else if (cpu_feature_enabled(X86_FEATURE_TDX_GUEST)) {
pr_info("using TDX aware idle routine\n");
x86_idle = tdx_safe_halt;
} else
x86_idle = default_idle;
}
......
......@@ -1083,6 +1083,11 @@ static int do_boot_cpu(int apicid, int cpu, struct task_struct *idle,
unsigned long boot_error = 0;
unsigned long timeout;
#ifdef CONFIG_X86_64
/* If 64-bit wakeup method exists, use the 64-bit mode trampoline IP */
if (apic->wakeup_secondary_cpu_64)
start_ip = real_mode_header->trampoline_start64;
#endif
idle->thread.sp = (unsigned long)task_pt_regs(idle);
early_gdt_descr.address = (unsigned long)get_cpu_gdt_rw(cpu);
initial_code = (unsigned long)start_secondary;
......@@ -1124,11 +1129,14 @@ static int do_boot_cpu(int apicid, int cpu, struct task_struct *idle,
/*
* Wake up a CPU in difference cases:
* - Use the method in the APIC driver if it's defined
* - Use a method from the APIC driver if one defined, with wakeup
* straight to 64-bit mode preferred over wakeup to RM.
* Otherwise,
* - Use an INIT boot APIC message for APs or NMI for BSP.
*/
if (apic->wakeup_secondary_cpu)
if (apic->wakeup_secondary_cpu_64)
boot_error = apic->wakeup_secondary_cpu_64(apicid, start_ip);
else if (apic->wakeup_secondary_cpu)
boot_error = apic->wakeup_secondary_cpu(apicid, start_ip);
else
boot_error = wakeup_cpu_via_init_nmi(cpu, start_ip, apicid,
......
......@@ -62,6 +62,7 @@
#include <asm/insn.h>
#include <asm/insn-eval.h>
#include <asm/vdso.h>
#include <asm/tdx.h>
#ifdef CONFIG_X86_64
#include <asm/x86_init.h>
......@@ -686,13 +687,40 @@ static bool try_fixup_enqcmd_gp(void)
#endif
}
static bool gp_try_fixup_and_notify(struct pt_regs *regs, int trapnr,
unsigned long error_code, const char *str)
{
if (fixup_exception(regs, trapnr, error_code, 0))
return true;
current->thread.error_code = error_code;
current->thread.trap_nr = trapnr;
/*
* To be potentially processing a kprobe fault and to trust the result
* from kprobe_running(), we have to be non-preemptible.
*/
if (!preemptible() && kprobe_running() &&
kprobe_fault_handler(regs, trapnr))
return true;
return notify_die(DIE_GPF, str, regs, error_code, trapnr, SIGSEGV) == NOTIFY_STOP;
}
static void gp_user_force_sig_segv(struct pt_regs *regs, int trapnr,
unsigned long error_code, const char *str)
{
current->thread.error_code = error_code;
current->thread.trap_nr = trapnr;
show_signal(current, SIGSEGV, "", str, regs, error_code);
force_sig(SIGSEGV);
}
DEFINE_IDTENTRY_ERRORCODE(exc_general_protection)
{
char desc[sizeof(GPFSTR) + 50 + 2*sizeof(unsigned long) + 1] = GPFSTR;
enum kernel_gp_hint hint = GP_NO_HINT;
struct task_struct *tsk;
unsigned long gp_addr;
int ret;
if (user_mode(regs) && try_fixup_enqcmd_gp())
return;
......@@ -711,40 +739,18 @@ DEFINE_IDTENTRY_ERRORCODE(exc_general_protection)
return;
}
tsk = current;
if (user_mode(regs)) {
if (fixup_iopl_exception(regs))
goto exit;
tsk->thread.error_code = error_code;
tsk->thread.trap_nr = X86_TRAP_GP;
if (fixup_vdso_exception(regs, X86_TRAP_GP, error_code, 0))
goto exit;
show_signal(tsk, SIGSEGV, "", desc, regs, error_code);
force_sig(SIGSEGV);
gp_user_force_sig_segv(regs, X86_TRAP_GP, error_code, desc);
goto exit;
}
if (fixup_exception(regs, X86_TRAP_GP, error_code, 0))
goto exit;
tsk->thread.error_code = error_code;
tsk->thread.trap_nr = X86_TRAP_GP;
/*
* To be potentially processing a kprobe fault and to trust the result
* from kprobe_running(), we have to be non-preemptible.
*/
if (!preemptible() &&
kprobe_running() &&
kprobe_fault_handler(regs, X86_TRAP_GP))
goto exit;
ret = notify_die(DIE_GPF, desc, regs, error_code, X86_TRAP_GP, SIGSEGV);
if (ret == NOTIFY_STOP)
if (gp_try_fixup_and_notify(regs, X86_TRAP_GP, error_code, desc))
goto exit;
if (error_code)
......@@ -1343,6 +1349,91 @@ DEFINE_IDTENTRY(exc_device_not_available)
}
}
#ifdef CONFIG_INTEL_TDX_GUEST
#define VE_FAULT_STR "VE fault"
static void ve_raise_fault(struct pt_regs *regs, long error_code)
{
if (user_mode(regs)) {
gp_user_force_sig_segv(regs, X86_TRAP_VE, error_code, VE_FAULT_STR);
return;
}
if (gp_try_fixup_and_notify(regs, X86_TRAP_VE, error_code, VE_FAULT_STR))
return;
die_addr(VE_FAULT_STR, regs, error_code, 0);
}
/*
* Virtualization Exceptions (#VE) are delivered to TDX guests due to
* specific guest actions which may happen in either user space or the
* kernel:
*
* * Specific instructions (WBINVD, for example)
* * Specific MSR accesses
* * Specific CPUID leaf accesses
* * Access to specific guest physical addresses
*
* In the settings that Linux will run in, virtualization exceptions are
* never generated on accesses to normal, TD-private memory that has been
* accepted (by BIOS or with tdx_enc_status_changed()).
*
* Syscall entry code has a critical window where the kernel stack is not
* yet set up. Any exception in this window leads to hard to debug issues
* and can be exploited for privilege escalation. Exceptions in the NMI
* entry code also cause issues. Returning from the exception handler with
* IRET will re-enable NMIs and nested NMI will corrupt the NMI stack.
*
* For these reasons, the kernel avoids #VEs during the syscall gap and
* the NMI entry code. Entry code paths do not access TD-shared memory,
* MMIO regions, use #VE triggering MSRs, instructions, or CPUID leaves
* that might generate #VE. VMM can remove memory from TD at any point,
* but access to unaccepted (or missing) private memory leads to VM
* termination, not to #VE.
*
* Similarly to page faults and breakpoints, #VEs are allowed in NMI
* handlers once the kernel is ready to deal with nested NMIs.
*
* During #VE delivery, all interrupts, including NMIs, are blocked until
* TDGETVEINFO is called. It prevents #VE nesting until the kernel reads
* the VE info.
*
* If a guest kernel action which would normally cause a #VE occurs in
* the interrupt-disabled region before TDGETVEINFO, a #DF (fault
* exception) is delivered to the guest which will result in an oops.
*
* The entry code has been audited carefully for following these expectations.
* Changes in the entry code have to be audited for correctness vs. this
* aspect. Similarly to #PF, #VE in these places will expose kernel to
* privilege escalation or may lead to random crashes.
*/
DEFINE_IDTENTRY(exc_virtualization_exception)
{
struct ve_info ve;
/*
* NMIs/Machine-checks/Interrupts will be in a disabled state
* till TDGETVEINFO TDCALL is executed. This ensures that VE
* info cannot be overwritten by a nested #VE.
*/
tdx_get_ve_info(&ve);
cond_local_irq_enable(regs);
/*
* If tdx_handle_virt_exception() could not process
* it successfully, treat it as #GP(0) and handle it.
*/
if (!tdx_handle_virt_exception(regs, &ve))
ve_raise_fault(regs, 0);
cond_local_irq_disable(regs);
}
#endif
#ifdef CONFIG_X86_32
DEFINE_IDTENTRY_SW(iret_error)
{
......
......@@ -11,7 +11,7 @@
#include <asm/msr.h>
#include <asm/archrandom.h>
#include <asm/e820/api.h>
#include <asm/io.h>
#include <asm/shared/io.h>
/*
* When built for the regular kernel, several functions need to be stubbed out
......
......@@ -242,10 +242,15 @@ __ioremap_caller(resource_size_t phys_addr, unsigned long size,
* If the page being mapped is in memory and SEV is active then
* make sure the memory encryption attribute is enabled in the
* resulting mapping.
* In TDX guests, memory is marked private by default. If encryption
* is not requested (using encrypted), explicitly set decrypt
* attribute in all IOREMAPPED memory.
*/
prot = PAGE_KERNEL_IO;
if ((io_desc.flags & IORES_MAP_ENCRYPTED) || encrypted)
prot = pgprot_encrypted(prot);
else
prot = pgprot_decrypted(prot);
switch (pcm) {
case _PAGE_CACHE_MODE_UC:
......
......@@ -42,7 +42,14 @@ bool force_dma_unencrypted(struct device *dev)
static void print_mem_encrypt_feature_info(void)
{
pr_info("AMD Memory Encryption Features active:");
pr_info("Memory Encryption Features active:");
if (cpu_feature_enabled(X86_FEATURE_TDX_GUEST)) {
pr_cont(" Intel TDX\n");
return;
}
pr_cont(" AMD");
/* Secure Memory Encryption */
if (cc_platform_has(CC_ATTR_HOST_MEM_ENCRYPT)) {
......
......@@ -24,6 +24,7 @@ SYM_DATA_START(real_mode_header)
.long pa_sev_es_trampoline_start
#endif
#ifdef CONFIG_X86_64
.long pa_trampoline_start64
.long pa_trampoline_pgd;
#endif
/* ACPI S3 wakeup */
......
......@@ -70,7 +70,7 @@ SYM_CODE_START(trampoline_start)
movw $__KERNEL_DS, %dx # Data segment descriptor
# Enable protected mode
movl $X86_CR0_PE, %eax # protected mode (PE) bit
movl $(CR0_STATE & ~X86_CR0_PG), %eax
movl %eax, %cr0 # into protected mode
# flush prefetch and jump to startup_32
......@@ -143,13 +143,24 @@ SYM_CODE_START(startup_32)
movl %eax, %cr3
# Set up EFER
movl $MSR_EFER, %ecx
rdmsr
/*
* Skip writing to EFER if the register already has desired
* value (to avoid #VE for the TDX guest).
*/
cmp pa_tr_efer, %eax
jne .Lwrite_efer
cmp pa_tr_efer + 4, %edx
je .Ldone_efer
.Lwrite_efer:
movl pa_tr_efer, %eax
movl pa_tr_efer + 4, %edx
movl $MSR_EFER, %ecx
wrmsr
# Enable paging and in turn activate Long Mode
movl $(X86_CR0_PG | X86_CR0_WP | X86_CR0_PE), %eax
.Ldone_efer:
# Enable paging and in turn activate Long Mode.
movl $CR0_STATE, %eax
movl %eax, %cr0
/*
......@@ -161,6 +172,19 @@ SYM_CODE_START(startup_32)
ljmpl $__KERNEL_CS, $pa_startup_64
SYM_CODE_END(startup_32)
SYM_CODE_START(pa_trampoline_compat)
/*
* In compatibility mode. Prep ESP and DX for startup_32, then disable
* paging and complete the switch to legacy 32-bit mode.
*/
movl $rm_stack_end, %esp
movw $__KERNEL_DS, %dx
movl $(CR0_STATE & ~X86_CR0_PG), %eax
movl %eax, %cr0
ljmpl $__KERNEL32_CS, $pa_startup_32
SYM_CODE_END(pa_trampoline_compat)
.section ".text64","ax"
.code64
.balign 4
......@@ -169,6 +193,20 @@ SYM_CODE_START(startup_64)
jmpq *tr_start(%rip)
SYM_CODE_END(startup_64)
SYM_CODE_START(trampoline_start64)
/*
* APs start here on a direct transfer from 64-bit BIOS with identity
* mapped page tables. Load the kernel's GDT in order to gear down to
* 32-bit mode (to handle 4-level vs. 5-level paging), and to (re)load
* segment registers. Load the zero IDT so any fault triggers a
* shutdown instead of jumping back into BIOS.
*/
lidt tr_idt(%rip)
lgdt tr_gdt64(%rip)
ljmpl *tr_compat(%rip)
SYM_CODE_END(trampoline_start64)
.section ".rodata","a"
# Duplicate the global descriptor table
# so the kernel can live anywhere
......@@ -182,6 +220,17 @@ SYM_DATA_START(tr_gdt)
.quad 0x00cf93000000ffff # __KERNEL_DS
SYM_DATA_END_LABEL(tr_gdt, SYM_L_LOCAL, tr_gdt_end)
SYM_DATA_START(tr_gdt64)
.short tr_gdt_end - tr_gdt - 1 # gdt limit
.long pa_tr_gdt
.long 0
SYM_DATA_END(tr_gdt64)
SYM_DATA_START(tr_compat)
.long pa_trampoline_compat
.short __KERNEL32_CS
SYM_DATA_END(tr_compat)
.bss
.balign PAGE_SIZE
SYM_DATA(trampoline_pgd, .space PAGE_SIZE)
......
/* SPDX-License-Identifier: GPL-2.0 */
.section ".rodata","a"
.balign 16
SYM_DATA_LOCAL(tr_idt, .fill 1, 6, 0)
/*
* When a bootloader hands off to the kernel in 32-bit mode an
* IDT with a 2-byte limit and 4-byte base is needed. When a boot
* loader hands off to a kernel 64-bit mode the base address
* extends to 8-bytes. Reserve enough space for either scenario.
*/
SYM_DATA_START_LOCAL(tr_idt)
.short 0
.quad 0
SYM_DATA_END(tr_idt)
......@@ -62,8 +62,12 @@ static void send_morse(const char *pattern)
}
}
struct port_io_ops pio_ops;
void main(void)
{
init_default_io_ops();
/* Kill machine if structures are wrong */
if (wakeup_header.real_magic != 0x12345678)
while (1)
......
/* SPDX-License-Identifier: GPL-2.0 */
#include <asm/asm-offsets.h>
#include <asm/tdx.h>
/*
* TDCALL and SEAMCALL are supported in Binutils >= 2.36.
*/
#define tdcall .byte 0x66,0x0f,0x01,0xcc
#define seamcall .byte 0x66,0x0f,0x01,0xcf
/*
* TDX_MODULE_CALL - common helper macro for both
* TDCALL and SEAMCALL instructions.
*
* TDCALL - used by TDX guests to make requests to the
* TDX module and hypercalls to the VMM.
* SEAMCALL - used by TDX hosts to make requests to the
* TDX module.
*/
.macro TDX_MODULE_CALL host:req
/*
* R12 will be used as temporary storage for struct tdx_module_output
* pointer. Since R12-R15 registers are not used by TDCALL/SEAMCALL
* services supported by this function, it can be reused.
*/
/* Callee saved, so preserve it */
push %r12
/*
* Push output pointer to stack.
* After the operation, it will be fetched into R12 register.
*/
push %r9
/* Mangle function call ABI into TDCALL/SEAMCALL ABI: */
/* Move Leaf ID to RAX */
mov %rdi, %rax
/* Move input 4 to R9 */
mov %r8, %r9
/* Move input 3 to R8 */
mov %rcx, %r8
/* Move input 1 to RCX */
mov %rsi, %rcx
/* Leave input param 2 in RDX */
.if \host
seamcall
/*
* SEAMCALL instruction is essentially a VMExit from VMX root
* mode to SEAM VMX root mode. VMfailInvalid (CF=1) indicates
* that the targeted SEAM firmware is not loaded or disabled,
* or P-SEAMLDR is busy with another SEAMCALL. %rax is not
* changed in this case.
*
* Set %rax to TDX_SEAMCALL_VMFAILINVALID for VMfailInvalid.
* This value will never be used as actual SEAMCALL error code as
* it is from the Reserved status code class.
*/
jnc .Lno_vmfailinvalid
mov $TDX_SEAMCALL_VMFAILINVALID, %rax
.Lno_vmfailinvalid:
.else
tdcall
.endif
/*
* Fetch output pointer from stack to R12 (It is used
* as temporary storage)
*/
pop %r12
/*
* Since this macro can be invoked with NULL as an output pointer,
* check if caller provided an output struct before storing output
* registers.
*
* Update output registers, even if the call failed (RAX != 0).
* Other registers may contain details of the failure.
*/
test %r12, %r12
jz .Lno_output_struct
/* Copy result registers to output struct: */
movq %rcx, TDX_MODULE_rcx(%r12)
movq %rdx, TDX_MODULE_rdx(%r12)
movq %r8, TDX_MODULE_r8(%r12)
movq %r9, TDX_MODULE_r9(%r12)
movq %r10, TDX_MODULE_r10(%r12)
movq %r11, TDX_MODULE_r11(%r12)
.Lno_output_struct:
/* Restore the state of R12 register */
pop %r12
.endm
......@@ -80,6 +80,16 @@ enum cc_attr {
* using AMD SEV-SNP features.
*/
CC_ATTR_GUEST_SEV_SNP,
/**
* @CC_ATTR_HOTPLUG_DISABLED: Hotplug is not supported or disabled.
*
* The platform/OS is running as a guest/virtual machine does not
* support CPU hotplug feature.
*
* Examples include TDX Guest.
*/
CC_ATTR_HOTPLUG_DISABLED,
};
#ifdef CONFIG_ARCH_HAS_CC_PLATFORM
......
......@@ -35,6 +35,7 @@
#include <linux/percpu-rwsem.h>
#include <linux/cpuset.h>
#include <linux/random.h>
#include <linux/cc_platform.h>
#include <trace/events/power.h>
#define CREATE_TRACE_POINTS
......@@ -1190,6 +1191,12 @@ static int __ref _cpu_down(unsigned int cpu, int tasks_frozen,
static int cpu_down_maps_locked(unsigned int cpu, enum cpuhp_state target)
{
/*
* If the platform does not support hotplug, report it explicitly to
* differentiate it from a transient offlining failure.
*/
if (cc_platform_has(CC_ATTR_HOTPLUG_DISABLED))
return -EOPNOTSUPP;
if (cpu_hotplug_disabled)
return -EBUSY;
return _cpu_down(cpu, 0, target);
......
Markdown is supported
0%
or
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment