Index | Thread | Search

From:
Alexander Bluhm <bluhm@openbsd.org>
Subject:
Re: SEV-ES guest: locore #VC trap handling
To:
Mike Larkin <mlarkin@nested.page>
Cc:
tech@openbsd.org
Date:
Fri, 20 Jun 2025 21:06:37 +0200

Download raw body.

Thread
On Fri, Jun 20, 2025 at 12:10:19AM -0700, Mike Larkin wrote:
> On Sun, Jun 15, 2025 at 04:55:12PM +0200, Alexander Bluhm wrote:
> > On Sat, Jun 14, 2025 at 10:23:53PM -0700, Mike Larkin wrote:
> > > On Wed, May 21, 2025 at 05:10:27PM +0200, Hans-J?rg H?xer wrote:
> > > > Hi,
> > > >
> > > > this change deals with locore for SEV-ES enabled guests.  The approach
> > > > might be a bit controversial.  And it requires a diff for vmd(8), that
> > > > I've also attached, to simplify the discussion:
> > > >
> > > >     SEV-ES guest: locore #VC trap handling
> > > >
> > > >     When locore is executed by a SEV-ES enabled guest the first cpuid
> > > >     instruction will raise a #VC trap that will need to be handled.
> > > >     However, at that point in time the guest does not know wether it's
> > > >     a guest at all, if it is running on an AMD cpu with SEV-ES enabled,
> > > >     etc.
> > > >
> > > >     To resolve this chicken-egg situation we undconditionally setup a
> > > >
> > > >     As vmd(8) configures the runtime for locore to be in 32 bit
> > > >     compatibility mode a raised #VC exception will switch to long mode.
> > > >     And the CPU will expect a 64 bit entry in the IDT.  When running
> > > >     on eg. KVM locore is execute in 32 bit legacy mode.  There the
> > > >     CPU will expect a 32 bit entry in the IDT.
> > > >
> > > >     To accomodate both situations, we set up both 64 and 32 bit handler
> > > >     in the IDT.
> > > >
> > > >     Additionally, vmd(8) has to setup a long mode segment in the GDT.
> > > >
> > > >     Both #VC trap handler use the MSR protocol to talk to the hypervisor
> > > >     to emulate CPUID.  The MSR protocol only supports "simple" CPUID
> > > >     without subfunctions.
> > > >
> > > >     Note:  When SEV-ES is enabled, the hypervisor can not intercept
> > > >     writes to EFER beforehand, only after the write.  Thus on vmm(4)
> > > >     with directly executed kernel we are in compatibility mode and
> > > >     EFER_LMA is set.  As resetting EFER_LMA raises #GP we have to
> > > >     preserve it.
> > > >
> > > > Take care,
> > > > HJ.
> > > >
> > >
> > > This one probably needs a lot of testing. What tests have you guys done so
> > > far? My SEV-ES machine is available now, I will test this and let you know
> > > but we probably want a lot of testing from different host/guest combos.
> >
> > I know that this part does not break SEV (without ES) on vmm/vmd
> > and kvm/qemu.
> >
> > Below is the complete diff that hshoexer@ sent some months ago
> > rebased to current.  With that I can run OpenBSD guest with SEV-ES
> > in vmm/vmd.  On kvm/qemu SEV still works, but I cannot get guest
> > running with SEV-ES.
> >
> > So if you agree with the locore0 part, that was in the original
> > mail of this thread, I would like to commit it.
> >
> > bluhm
> >
> 
> I think this diff needs a fair amount of work before it can go in.

The latest diff I sent is everything that is missing for SEV-ES
support.  It has some old XXX from months ago, that have been
resolved.  It was never intended for commit, I provided it just in
case you want to test the whole thing.  In your previous mail you
said that you want to test something.  The huge diff was not intended
for review.

> See below for comments. This is going to require a ton of testing. How do
> you want to proceed? For example, I think this needs to be tested as a
> guest on a number of sev/sev-es capable hypervisors and a wide variety
> of hardware. You already mentioned one breakage above, right?

The only problem I see is SEV-ES on Linux KVM/qemu host.  But we
never had support for that before.  I would like to address this
after OpenBSD running on vmm/vmd has been commited.

I do have Linux KVM/qemu and OpenBSD vmm/vmd on AMD and Intel.  On
AMD I can run SEV with OpenBSD guest, but not SEV-ES because the
diff is not fully commited yet.  As currently SEV-ES is not supported,
this is not a regression.

Below is the diff that needs review.  It is the locore0 part of the
original mail in this thread.  The only testing that has to be done
is that it does not break anything.  SEV-ES will not work, as it
is not complete.

Two of your remarks still apply with this next step.

> > @@ -193,6 +194,58 @@ bi_size_ok:
> >  	pushl	$PSL_MBO
> >  	popfl
> >
> > +	/*
> > +	 * Setup temporary #VC trap handler, in case we are running
> > +	 * on an AMD CPU in SEV-ES guest mode.  Will be reset by
> > +	 * init_x86_64().
> > +	 * We are setting up two handlers:
> > +	 *
> > +	 * 1) locore_vc_trap32:  Triggered when we are running in
> > +	 *    32-bit legacy mode.
> > +	 *
> > +	 * 2) locore_vc_trap64:  Triggered when we are running in
> > +	 *    32-bit compatibility mode.
> > +	 *
> > +	 * The latter one is used by vmd(8).
> 
> Please clarify; *when* is this used by vmd? I believe you mean when
> we do a direct kernel launch? If not, then why do we need the 32 bit
> one?

There are two ways we may enter the kernel.  KMV/qemu uses special
Tiano-Core EFI implementation.  For vmm/vmd currently we support
direct kernel boot only.  hshoexer@ explained to me why we need
both 32 bit methods here, but I forgot.

> >  	/* XXX merge these */
> >  	call	init_x86_64
> >  	call	main
> >
> > +	/* MSR Protocol Request Codes */
> > +#define MSRPROTO_CPUID_REQ	0x4
> > +#define MSRPROTO_TERM_REQ	0x100
> > +
> > +vc_cpuid64:
> > +	shll	$30, %eax		/* requested register */
> > +	orl	$MSRPROTO_CPUID_REQ, %eax
> > +	movl	%ebx, %edx		/* CPUID function */
> > +	movl	$MSR_SEV_GHCB, %ecx
> > +	wrmsr
> > +	rep vmmcall
> 
> Out of curiousity, why is the rep prefix needed here?

I don't know.  hshoexer?

Below is the next diff, that I want to commit.  It is guest side
only.  My testing can only cover that nothing breaks.

ok?

bluhm

On Wed, May 21, 2025 at 05:19:14PM +0200, Hans-J?rg H?xer wrote:
> >     SEV-ES guest: locore #VC trap handling
> >
> >     When locore is executed by a SEV-ES enabled guest the first cpuid
> >     instruction will raise a #VC trap that will need to be handled.
> >     However, at that point in time the guest does not know wether it's
> >     a guest at all, if it is running on an AMD cpu with SEV-ES enabled,
> >     etc.
> >
> >     To resolve this chicken-egg situation we undconditionally setup a
> >     #VC trap handler.  If we are not a SEV-ES enabled guest, or we are
> >     runnign on some non-AMD CPU we will not raise #VC (hopefully).  On
> >     Intel CPUs the vector for #VC is reserved.
> >
> >     As vmd(8) configures the runtime for locore to be in 32 bit
> >     compatibility mode a raised #VC exception will switch to long mode.
> >     And the CPU will expect a 64 bit entry in the IDT.  When running
> >     on eg. KVM locore is execute in 32 bit legacy mode.  There the
> >     CPU will expect a 32 bit entry in the IDT.
> >
> >     To accomodate both situations, we set up both 64 and 32 bit handler
> >     in the IDT.
> >
> >     Additionally, vmd(8) has to setup a long mode segment in the GDT.
> >
> >     Both #VC trap handler use the MSR protocol to talk to the hypervisor
> >     to emulate CPUID.  The MSR protocol only supports "simple" CPUID
> >     without subfunctions.
> >
> >     Note:  When SEV-ES is enabled, the hypervisor can not intercept
> >     writes to EFER beforehand, only after the write.  Thus on vmm(4)
> >     with directly executed kernel we are in compatibility mode and
> >     EFER_LMA is set.  As resetting EFER_LMA raises #GP we have to
> >     preserve it.

Index: arch/amd64/amd64/locore0.S
===================================================================
RCS file: /data/mirror/openbsd/cvs/src/sys/arch/amd64/amd64/locore0.S,v
diff -u -p -r1.27 locore0.S
--- arch/amd64/amd64/locore0.S	5 May 2025 23:02:39 -0000	1.27
+++ arch/amd64/amd64/locore0.S	20 Jun 2025 16:31:14 -0000
@@ -111,6 +111,9 @@
 #include <machine/param.h>
 #include <machine/segments.h>
 #include <machine/specialreg.h>
+#include <machine/trap.h>
+#include <machine/ghcb.h>
+#include <machine/vmmvar.h>
 
 /*
  * override user-land alignment before including asm.h
@@ -193,6 +196,58 @@ bi_size_ok:
 	pushl	$PSL_MBO
 	popfl
 
+	/*
+	 * Setup temporary #VC trap handler, in case we are running
+	 * on an AMD CPU in SEV-ES guest mode.  Will be reset by
+	 * init_x86_64().
+	 * We are setting up two handlers:
+	 *
+	 * 1) locore_vc_trap32:  Triggered when we are running in
+	 *    32-bit legacy mode.
+	 *
+	 * 2) locore_vc_trap64:  Triggered when we are running in
+	 *    32-bit compatibility mode.
+	 *
+	 * The latter one is used by vmd(8).
+	 */
+	movl	$RELOC(early_idt), %ecx
+	movl	$T_VC, %edx
+	leal	(%ecx, %edx, 8), %ecx		/* 32bit #VC IDT slot */
+
+	pushl	%cs				/* get current %cs */
+	popl	%ebx
+	shll	$16, %ebx
+
+	movl	$RELOC(locore_vc_trap32), %eax
+	andl	$0x0000ffff, %eax
+	orl	%ebx, %eax			/* use current %cs */
+	movl	%eax, (%ecx)
+
+	movl	$RELOC(locore_vc_trap32), %eax
+	andl	$0xffff0000, %eax
+	orl	$((0x80 | SDT_SYS386IGT) << 8), %eax
+	movl	%eax, 4(%ecx)
+
+	movl	$RELOC(early_idt), %ecx
+	movl	$(2 * T_VC), %edx
+	leal	(%ecx, %edx, 8), %ecx		/* 64bit #VC IDT slot */
+
+	movl	$RELOC(locore_vc_trap64), %eax
+	andl	$0x0000ffff, %eax
+	orl	$(GSEL(3, SEL_KPL) << 16), %eax
+	movl	%eax, (%ecx)
+
+	movl	$RELOC(locore_vc_trap64), %eax
+	andl	$0xffff0000, %eax
+	orl	$((0x80 | SDT_SYS386IGT) << 8), %eax
+	movl	%eax, 4(%ecx)
+	xorl	%eax, %eax
+	movl	%eax, 8(%ecx)
+	movl	%eax, 12(%ecx)
+
+	movl	$RELOC(idtlc), %eax
+	lidt	(%eax)
+
 	/* Reset debug control registers */
 	xorl	%eax,%eax
 	movl	%eax,%dr6
@@ -631,8 +686,14 @@ store_pte:
 	 */
 	movl	$MSR_EFER,%ecx
 	rdmsr
+	movl	%eax,%ebx
 	xorl	%eax,%eax	/* XXX */
 	orl	$(EFER_LME|EFER_SCE),%eax
+	/* If set, preserve LMA */
+	testl	$EFER_LMA,%ebx
+	jz	efer_nxe
+	orl	$EFER_LMA,%eax
+efer_nxe:
 	movl	RELOC((pg_nx + 4)), %ebx
 	cmpl	$0, %ebx
 	je	write_efer
@@ -745,6 +806,118 @@ longmode_hi:
 	call	init_x86_64
 	call	main
 
+vc_cpuid64:
+	shll	$30, %eax		/* requested register */
+	orl	$MSR_PROTO_CPUID_REQ, %eax
+	movl	%ebx, %edx		/* CPUID function */
+	movl	$MSR_SEV_GHCB, %ecx
+	wrmsr
+	rep vmmcall
+	rdmsr
+	ret
+
+	.globl	locore_vc_trap64
+locore_vc_trap64:
+	pushq	%rax
+	pushq	%rbx
+	pushq	%rcx
+	pushq	%rdx
+
+	cmpl	$SVM_VMEXIT_CPUID, 32(%rsp)
+	jne	.Lterminate64
+
+	movl	%eax, %ebx		/* save CPUID function */
+
+	movl	$0, %eax		/* request cpuid, get %eax */
+	call	vc_cpuid64
+	movq	%rdx, 24(%rsp)
+
+	movl	$1, %eax		/* get %ebx */
+	call	vc_cpuid64
+	movq	%rdx, 16(%rsp)
+
+	movl	$2, %eax		/* get %ecx */
+	call	vc_cpuid64
+	movq	%rdx, 8(%rsp)
+
+	movl	$3, %eax		/* get %edx */
+	call	vc_cpuid64
+	movq	%rdx, 0(%rsp)
+
+	popq	%rdx
+	popq	%rcx
+	popq	%rbx
+	popq	%rax
+	addq	$8, %rsp
+	addq	$2, (%rsp)
+	iretq
+
+.Lterminate64:
+	movl	$MSR_PROTO_TERMINATE, %eax
+	movl	$MSR_SEV_GHCB, %ecx
+	wrmsr
+	rep vmmcall
+.Lterm_loop64:
+	hlt
+	jmp	.Lterm_loop64
+
+	.code32
+vc_cpuid32:
+	shll	$30, %eax		/* requested register */
+	orl	$MSR_PROTO_CPUID_REQ, %eax
+	movl	%ebx, %edx		/* CPUID function */
+	movl	$MSR_SEV_GHCB, %ecx
+	wrmsr
+	rep vmmcall
+	rdmsr
+	ret
+
+	.globl	locore_vc_trap32
+locore_vc_trap32:
+	pushl	%eax
+	pushl	%ebx
+	pushl	%ecx
+	pushl	%edx
+
+	cmpl	$SVM_VMEXIT_CPUID, 16(%esp)
+	jne	.Lterminate32
+
+	movl	%eax, %ebx		/* save CPUID function */
+
+	movl	$0, %eax		/* request cpuid, get %eax */
+	call	vc_cpuid32
+	movl	%edx, 12(%esp)
+
+	movl	$1, %eax		/* get %ebx */
+	call	vc_cpuid32
+	movl	%edx, 8(%esp)
+
+	movl	$2, %eax		/* get %ecx */
+	call	vc_cpuid32
+	movl	%edx, 4(%esp)
+
+	movl	$3, %eax		/* get %edx */
+	call	vc_cpuid32
+	movl	%edx, 0(%esp)
+
+	popl	%edx
+	popl	%ecx
+	popl	%ebx
+	popl	%eax
+	addl	$4, %esp
+	addl	$2, (%esp)
+	iret
+
+.Lterminate32:
+	movl	$MSR_PROTO_TERMINATE, %eax
+	movl	$MSR_SEV_GHCB, %ecx
+	wrmsr
+	rep vmmcall
+.Lterm_loop32:
+	hlt
+	jmp	.Lterm_loop32
+
+
 	.section .codepatch,"a"
 	.align	8, 0xcc
 	.globl codepatch_begin
@@ -757,6 +930,20 @@ codepatch_end:
 	.previous
 
 	.data
+	.globl	idtlc			/* temporary locore IDT */
+idtlc:
+	.word	early_idt_end-early_idt-1
+	.long	_RELOC(early_idt)
+	.align 64, 0xcc
+
+	.globl	early_idt
+early_idt:
+	.rept	NIDT
+	.quad	0x0000000000000000
+	.quad	0x0000000000000000
+	.endr
+early_idt_end:
+
 	.globl	gdt64
 gdt64:
 	.word	gdt64_end-gdt64_start-1
Index: arch/amd64/include/ghcb.h
===================================================================
RCS file: /data/mirror/openbsd/cvs/src/sys/arch/amd64/include/ghcb.h,v
diff -u -p -r1.2 ghcb.h
--- arch/amd64/include/ghcb.h	28 May 2025 07:59:05 -0000	1.2
+++ arch/amd64/include/ghcb.h	20 Jun 2025 16:31:14 -0000
@@ -19,6 +19,8 @@
 #ifndef _MACHINE_GHCB_H_
 #define _MACHINE_GHCB_H_
 
+#ifndef _LOCORE
+
 #include <machine/frame.h>
 
 #define GHCB_OFFSET(m)			((m) / 8)
@@ -98,12 +100,15 @@ struct ghcb_sync {
 	int			sz_c;
 	int			sz_d;
 };
+#endif /* !_LOCORE */
 
 /* Definitions used with the MSR protocol */
 #define MSR_PROTO_CPUID_REQ	0x4
 #define MSR_PROTO_CPUID_RESP	0x5
 #define MSR_PROTO_TERMINATE	0x100
 
+#ifndef _LOCORE
+
 void	ghcb_clear(struct ghcb_sa *);
 int	ghcb_valbm_set(uint8_t *, int);
 int	ghcb_valbm_isset(uint8_t *, int);
@@ -114,5 +119,7 @@ void	ghcb_sync_val(int, int, struct ghcb
 void	ghcb_sync_out(struct trapframe *, uint64_t, uint64_t, uint64_t,
 	    struct ghcb_sa *, struct ghcb_sync *);
 void	ghcb_sync_in(struct trapframe *, struct ghcb_sa *, struct ghcb_sync *);
+
+#endif /* !_LOCORE */
 
 #endif /* !_MACHINE_GHCB_H_ */
Index: arch/amd64/include/trap.h
===================================================================
RCS file: /data/mirror/openbsd/cvs/src/sys/arch/amd64/include/trap.h,v
diff -u -p -r1.5 trap.h
--- arch/amd64/include/trap.h	15 Apr 2023 01:22:50 -0000	1.5
+++ arch/amd64/include/trap.h	20 Jun 2025 16:31:14 -0000
@@ -62,3 +62,4 @@
 #define	T_XMM		19	/* SSE FP exception */
 #define	T_VE		20	/* virtualization exception */
 #define	T_CP		21	/* control protection exception */
+#define	T_VC		29	/* VMM communication exception */
Index: arch/amd64/include/vmmvar.h
===================================================================
RCS file: /data/mirror/openbsd/cvs/src/sys/arch/amd64/include/vmmvar.h,v
diff -u -p -r1.114 vmmvar.h
--- arch/amd64/include/vmmvar.h	24 May 2025 12:47:00 -0000	1.114
+++ arch/amd64/include/vmmvar.h	20 Jun 2025 16:31:14 -0000
@@ -21,6 +21,8 @@
 #ifndef _MACHINE_VMMVAR_H_
 #define _MACHINE_VMMVAR_H_
 
+#ifndef _LOCORE
+
 #define VMM_HV_SIGNATURE 	"OpenBSDVMM58"
 
 /* VMX: Basic Exit Reasons */
@@ -94,6 +96,8 @@
 #define VMX_MAX_CR3_TARGETS			256
 #define VMX_VMCS_PA_CLEAR			0xFFFFFFFFFFFFFFFFUL
 
+#endif	/* ! _LOCORE */
+
 /*
  * SVM: Intercept codes (exit reasons)
  */
@@ -262,6 +266,8 @@
 #define SVM_VMEXIT_VMGEXIT			0x403
 #define SVM_VMEXIT_INVALID			-1
 
+#ifndef _LOCORE
+
 /*
  * Exception injection vectors (these correspond to the CPU exception types
  * defined in the SDM.)
@@ -1056,5 +1062,7 @@ int	vcpu_reset_regs(struct vcpu *, struc
 int	svm_get_vmsa_pa(uint32_t, uint32_t, uint64_t *);
 
 #endif /* _KERNEL */
+
+#endif	/* ! _LOCORE */
 
 #endif /* ! _MACHINE_VMMVAR_H_ */