Index | Thread | Search

From:
Alexander Bluhm <bluhm@openbsd.org>
Subject:
Re: SEV-ES guest: locore #VC trap handling
To:
Mike Larkin <mlarkin@nested.page>
Cc:
tech@openbsd.org
Date:
Mon, 23 Jun 2025 01:36:20 +0200

Download raw body.

Thread
On Sat, Jun 21, 2025 at 10:43:23PM -0700, Mike Larkin wrote:
> On Fri, Jun 20, 2025 at 09:06:37PM +0200, Alexander Bluhm wrote:
> > On Fri, Jun 20, 2025 at 12:10:19AM -0700, Mike Larkin wrote:
> > > On Sun, Jun 15, 2025 at 04:55:12PM +0200, Alexander Bluhm wrote:
> > > > On Sat, Jun 14, 2025 at 10:23:53PM -0700, Mike Larkin wrote:
> > > > > On Wed, May 21, 2025 at 05:10:27PM +0200, Hans-J?rg H?xer wrote:
> > > > > > Hi,
> > > > > >
> > > > > > this change deals with locore for SEV-ES enabled guests.  The approach
> > > > > > might be a bit controversial.  And it requires a diff for vmd(8), that
> > > > > > I've also attached, to simplify the discussion:
> > > > > >
> > > > > >     SEV-ES guest: locore #VC trap handling
> > > > > >
> > > > > >     When locore is executed by a SEV-ES enabled guest the first cpuid
> > > > > >     instruction will raise a #VC trap that will need to be handled.
> > > > > >     However, at that point in time the guest does not know wether it's
> > > > > >     a guest at all, if it is running on an AMD cpu with SEV-ES enabled,
> > > > > >     etc.
> > > > > >
> > > > > >     To resolve this chicken-egg situation we undconditionally setup a
> > > > > >
> > > > > >     As vmd(8) configures the runtime for locore to be in 32 bit
> > > > > >     compatibility mode a raised #VC exception will switch to long mode.
> > > > > >     And the CPU will expect a 64 bit entry in the IDT.  When running
> > > > > >     on eg. KVM locore is execute in 32 bit legacy mode.  There the
> > > > > >     CPU will expect a 32 bit entry in the IDT.
> > > > > >
> > > > > >     To accomodate both situations, we set up both 64 and 32 bit handler
> > > > > >     in the IDT.
> > > > > >
> > > > > >     Additionally, vmd(8) has to setup a long mode segment in the GDT.
> > > > > >
> > > > > >     Both #VC trap handler use the MSR protocol to talk to the hypervisor
> > > > > >     to emulate CPUID.  The MSR protocol only supports "simple" CPUID
> > > > > >     without subfunctions.
> > > > > >
> > > > > >     Note:  When SEV-ES is enabled, the hypervisor can not intercept
> > > > > >     writes to EFER beforehand, only after the write.  Thus on vmm(4)
> > > > > >     with directly executed kernel we are in compatibility mode and
> > > > > >     EFER_LMA is set.  As resetting EFER_LMA raises #GP we have to
> > > > > >     preserve it.
> > > > > >
> > > > > > Take care,
> > > > > > HJ.
> > > > > >
> > > > >
> > > > > This one probably needs a lot of testing. What tests have you guys done so
> > > > > far? My SEV-ES machine is available now, I will test this and let you know
> > > > > but we probably want a lot of testing from different host/guest combos.
> > > >
> > > > I know that this part does not break SEV (without ES) on vmm/vmd
> > > > and kvm/qemu.
> > > >
> > > > Below is the complete diff that hshoexer@ sent some months ago
> > > > rebased to current.  With that I can run OpenBSD guest with SEV-ES
> > > > in vmm/vmd.  On kvm/qemu SEV still works, but I cannot get guest
> > > > running with SEV-ES.
> > > >
> > > > So if you agree with the locore0 part, that was in the original
> > > > mail of this thread, I would like to commit it.
> > > >
> > > > bluhm
> > > >
> > >
> > > I think this diff needs a fair amount of work before it can go in.
> >
> > The latest diff I sent is everything that is missing for SEV-ES
> > support.  It has some old XXX from months ago, that have been
> > resolved.  It was never intended for commit, I provided it just in
> > case you want to test the whole thing.  In your previous mail you
> > said that you want to test something.  The huge diff was not intended
> > for review.
> >
> > > See below for comments. This is going to require a ton of testing. How do
> > > you want to proceed? For example, I think this needs to be tested as a
> > > guest on a number of sev/sev-es capable hypervisors and a wide variety
> > > of hardware. You already mentioned one breakage above, right?
> >
> > The only problem I see is SEV-ES on Linux KVM/qemu host.  But we
> > never had support for that before.  I would like to address this
> > after OpenBSD running on vmm/vmd has been commited.
> >
> > I do have Linux KVM/qemu and OpenBSD vmm/vmd on AMD and Intel.  On
> > AMD I can run SEV with OpenBSD guest, but not SEV-ES because the
> > diff is not fully commited yet.  As currently SEV-ES is not supported,
> > this is not a regression.
> >
> > Below is the diff that needs review.  It is the locore0 part of the
> > original mail in this thread.  The only testing that has to be done
> > is that it does not break anything.  SEV-ES will not work, as it
> > is not complete.
> >
> > Two of your remarks still apply with this next step.
> >
> > > > @@ -193,6 +194,58 @@ bi_size_ok:
> > > >  	pushl	$PSL_MBO
> > > >  	popfl
> > > >
> > > > +	/*
> > > > +	 * Setup temporary #VC trap handler, in case we are running
> > > > +	 * on an AMD CPU in SEV-ES guest mode.  Will be reset by
> > > > +	 * init_x86_64().
> > > > +	 * We are setting up two handlers:
> > > > +	 *
> > > > +	 * 1) locore_vc_trap32:  Triggered when we are running in
> > > > +	 *    32-bit legacy mode.
> > > > +	 *
> > > > +	 * 2) locore_vc_trap64:  Triggered when we are running in
> > > > +	 *    32-bit compatibility mode.
> > > > +	 *
> > > > +	 * The latter one is used by vmd(8).
> > >
> > > Please clarify; *when* is this used by vmd? I believe you mean when
> > > we do a direct kernel launch? If not, then why do we need the 32 bit
> > > one?
> >
> > There are two ways we may enter the kernel.  KMV/qemu uses special
> > Tiano-Core EFI implementation.  For vmm/vmd currently we support
> > direct kernel boot only.  hshoexer@ explained to me why we need
> > both 32 bit methods here, but I forgot.
> >
> > > >  	/* XXX merge these */
> > > >  	call	init_x86_64
> > > >  	call	main
> > > >
> > > > +	/* MSR Protocol Request Codes */
> > > > +#define MSRPROTO_CPUID_REQ	0x4
> > > > +#define MSRPROTO_TERM_REQ	0x100
> > > > +
> > > > +vc_cpuid64:
> > > > +	shll	$30, %eax		/* requested register */
> > > > +	orl	$MSRPROTO_CPUID_REQ, %eax
> > > > +	movl	%ebx, %edx		/* CPUID function */
> > > > +	movl	$MSR_SEV_GHCB, %ecx
> > > > +	wrmsr
> > > > +	rep vmmcall
> > >
> > > Out of curiousity, why is the rep prefix needed here?
> >
> > I don't know.  hshoexer?
> >
> > Below is the next diff, that I want to commit.  It is guest side
> > only.  My testing can only cover that nothing breaks.
> >
> > ok?
> >
> > bluhm
> >
> 
> I think this locore diff below looks ok to me but it changes some very
> fundamental early boot behavior. I was asking before what your plan is for
> testing this before commit. I don't think testing this on just a handful
> of machines is sufficient. It's particularly tough with this one as if it
> breaks someone, the machine will pretty much be bricked and require fixing
> via install media. That's why I'm hesitant.

As we do not touch disk or boot loader, boot> obsd should be enough
to recover.  I have tested it on laptops and servers, Intel and
AMD, virtual and physical machines, with and without SEV.  No fallout
so far.

I will commit tomorrow when hshoexer is in the office.  Thanks for
all your effort looking into this.

> If you are ready to handle any fallout, then sure. It's probably the right
> time in the release process for this. ok mlarkin on the diff itself in that
> case. There are a few things we can change later (like using .Lxxx in all
> symbols in locore0 and there are a few style(9) things) but nothing that
> warrants fixing now.
> 
> P.S. hshoexer: The locore0 page gets unmapped right after we are done with it;
> can you please verify that the #VC traps and locore0 IDTs aren't needed after
> that unmapping?

If I remember correctly, this handler is only temporarily used
during boot.  It will we replaced with a permanent handler as soon
the C code initializes.  This will be in a follow up diff.

bluhm

> 
> -ml
> 
> > On Wed, May 21, 2025 at 05:19:14PM +0200, Hans-J?rg H?xer wrote:
> > > >     SEV-ES guest: locore #VC trap handling
> > > >
> > > >     When locore is executed by a SEV-ES enabled guest the first cpuid
> > > >     instruction will raise a #VC trap that will need to be handled.
> > > >     However, at that point in time the guest does not know wether it's
> > > >     a guest at all, if it is running on an AMD cpu with SEV-ES enabled,
> > > >     etc.
> > > >
> > > >     To resolve this chicken-egg situation we undconditionally setup a
> > > >     #VC trap handler.  If we are not a SEV-ES enabled guest, or we are
> > > >     runnign on some non-AMD CPU we will not raise #VC (hopefully).  On
> > > >     Intel CPUs the vector for #VC is reserved.
> > > >
> > > >     As vmd(8) configures the runtime for locore to be in 32 bit
> > > >     compatibility mode a raised #VC exception will switch to long mode.
> > > >     And the CPU will expect a 64 bit entry in the IDT.  When running
> > > >     on eg. KVM locore is execute in 32 bit legacy mode.  There the
> > > >     CPU will expect a 32 bit entry in the IDT.
> > > >
> > > >     To accomodate both situations, we set up both 64 and 32 bit handler
> > > >     in the IDT.
> > > >
> > > >     Additionally, vmd(8) has to setup a long mode segment in the GDT.
> > > >
> > > >     Both #VC trap handler use the MSR protocol to talk to the hypervisor
> > > >     to emulate CPUID.  The MSR protocol only supports "simple" CPUID
> > > >     without subfunctions.
> > > >
> > > >     Note:  When SEV-ES is enabled, the hypervisor can not intercept
> > > >     writes to EFER beforehand, only after the write.  Thus on vmm(4)
> > > >     with directly executed kernel we are in compatibility mode and
> > > >     EFER_LMA is set.  As resetting EFER_LMA raises #GP we have to
> > > >     preserve it.
> >
> > Index: arch/amd64/amd64/locore0.S
> > ===================================================================
> > RCS file: /data/mirror/openbsd/cvs/src/sys/arch/amd64/amd64/locore0.S,v
> > diff -u -p -r1.27 locore0.S
> > --- arch/amd64/amd64/locore0.S	5 May 2025 23:02:39 -0000	1.27
> > +++ arch/amd64/amd64/locore0.S	20 Jun 2025 16:31:14 -0000
> > @@ -111,6 +111,9 @@
> >  #include <machine/param.h>
> >  #include <machine/segments.h>
> >  #include <machine/specialreg.h>
> > +#include <machine/trap.h>
> > +#include <machine/ghcb.h>
> > +#include <machine/vmmvar.h>
> >
> >  /*
> >   * override user-land alignment before including asm.h
> > @@ -193,6 +196,58 @@ bi_size_ok:
> >  	pushl	$PSL_MBO
> >  	popfl
> >
> > +	/*
> > +	 * Setup temporary #VC trap handler, in case we are running
> > +	 * on an AMD CPU in SEV-ES guest mode.  Will be reset by
> > +	 * init_x86_64().
> > +	 * We are setting up two handlers:
> > +	 *
> > +	 * 1) locore_vc_trap32:  Triggered when we are running in
> > +	 *    32-bit legacy mode.
> > +	 *
> > +	 * 2) locore_vc_trap64:  Triggered when we are running in
> > +	 *    32-bit compatibility mode.
> > +	 *
> > +	 * The latter one is used by vmd(8).
> > +	 */
> > +	movl	$RELOC(early_idt), %ecx
> > +	movl	$T_VC, %edx
> > +	leal	(%ecx, %edx, 8), %ecx		/* 32bit #VC IDT slot */
> > +
> > +	pushl	%cs				/* get current %cs */
> > +	popl	%ebx
> > +	shll	$16, %ebx
> > +
> > +	movl	$RELOC(locore_vc_trap32), %eax
> > +	andl	$0x0000ffff, %eax
> > +	orl	%ebx, %eax			/* use current %cs */
> > +	movl	%eax, (%ecx)
> > +
> > +	movl	$RELOC(locore_vc_trap32), %eax
> > +	andl	$0xffff0000, %eax
> > +	orl	$((0x80 | SDT_SYS386IGT) << 8), %eax
> > +	movl	%eax, 4(%ecx)
> > +
> > +	movl	$RELOC(early_idt), %ecx
> > +	movl	$(2 * T_VC), %edx
> > +	leal	(%ecx, %edx, 8), %ecx		/* 64bit #VC IDT slot */
> > +
> > +	movl	$RELOC(locore_vc_trap64), %eax
> > +	andl	$0x0000ffff, %eax
> > +	orl	$(GSEL(3, SEL_KPL) << 16), %eax
> > +	movl	%eax, (%ecx)
> > +
> > +	movl	$RELOC(locore_vc_trap64), %eax
> > +	andl	$0xffff0000, %eax
> > +	orl	$((0x80 | SDT_SYS386IGT) << 8), %eax
> > +	movl	%eax, 4(%ecx)
> > +	xorl	%eax, %eax
> > +	movl	%eax, 8(%ecx)
> > +	movl	%eax, 12(%ecx)
> > +
> > +	movl	$RELOC(idtlc), %eax
> > +	lidt	(%eax)
> > +
> >  	/* Reset debug control registers */
> >  	xorl	%eax,%eax
> >  	movl	%eax,%dr6
> > @@ -631,8 +686,14 @@ store_pte:
> >  	 */
> >  	movl	$MSR_EFER,%ecx
> >  	rdmsr
> > +	movl	%eax,%ebx
> >  	xorl	%eax,%eax	/* XXX */
> >  	orl	$(EFER_LME|EFER_SCE),%eax
> > +	/* If set, preserve LMA */
> > +	testl	$EFER_LMA,%ebx
> > +	jz	efer_nxe
> > +	orl	$EFER_LMA,%eax
> > +efer_nxe:
> >  	movl	RELOC((pg_nx + 4)), %ebx
> >  	cmpl	$0, %ebx
> >  	je	write_efer
> > @@ -745,6 +806,118 @@ longmode_hi:
> >  	call	init_x86_64
> >  	call	main
> >
> > +vc_cpuid64:
> > +	shll	$30, %eax		/* requested register */
> > +	orl	$MSR_PROTO_CPUID_REQ, %eax
> > +	movl	%ebx, %edx		/* CPUID function */
> > +	movl	$MSR_SEV_GHCB, %ecx
> > +	wrmsr
> > +	rep vmmcall
> > +	rdmsr
> > +	ret
> > +
> > +	.globl	locore_vc_trap64
> > +locore_vc_trap64:
> > +	pushq	%rax
> > +	pushq	%rbx
> > +	pushq	%rcx
> > +	pushq	%rdx
> > +
> > +	cmpl	$SVM_VMEXIT_CPUID, 32(%rsp)
> > +	jne	.Lterminate64
> > +
> > +	movl	%eax, %ebx		/* save CPUID function */
> > +
> > +	movl	$0, %eax		/* request cpuid, get %eax */
> > +	call	vc_cpuid64
> > +	movq	%rdx, 24(%rsp)
> > +
> > +	movl	$1, %eax		/* get %ebx */
> > +	call	vc_cpuid64
> > +	movq	%rdx, 16(%rsp)
> > +
> > +	movl	$2, %eax		/* get %ecx */
> > +	call	vc_cpuid64
> > +	movq	%rdx, 8(%rsp)
> > +
> > +	movl	$3, %eax		/* get %edx */
> > +	call	vc_cpuid64
> > +	movq	%rdx, 0(%rsp)
> > +
> > +	popq	%rdx
> > +	popq	%rcx
> > +	popq	%rbx
> > +	popq	%rax
> > +	addq	$8, %rsp
> > +	addq	$2, (%rsp)
> > +	iretq
> > +
> > +.Lterminate64:
> > +	movl	$MSR_PROTO_TERMINATE, %eax
> > +	movl	$MSR_SEV_GHCB, %ecx
> > +	wrmsr
> > +	rep vmmcall
> > +.Lterm_loop64:
> > +	hlt
> > +	jmp	.Lterm_loop64
> > +
> > +	.code32
> > +vc_cpuid32:
> > +	shll	$30, %eax		/* requested register */
> > +	orl	$MSR_PROTO_CPUID_REQ, %eax
> > +	movl	%ebx, %edx		/* CPUID function */
> > +	movl	$MSR_SEV_GHCB, %ecx
> > +	wrmsr
> > +	rep vmmcall
> > +	rdmsr
> > +	ret
> > +
> > +	.globl	locore_vc_trap32
> > +locore_vc_trap32:
> > +	pushl	%eax
> > +	pushl	%ebx
> > +	pushl	%ecx
> > +	pushl	%edx
> > +
> > +	cmpl	$SVM_VMEXIT_CPUID, 16(%esp)
> > +	jne	.Lterminate32
> > +
> > +	movl	%eax, %ebx		/* save CPUID function */
> > +
> > +	movl	$0, %eax		/* request cpuid, get %eax */
> > +	call	vc_cpuid32
> > +	movl	%edx, 12(%esp)
> > +
> > +	movl	$1, %eax		/* get %ebx */
> > +	call	vc_cpuid32
> > +	movl	%edx, 8(%esp)
> > +
> > +	movl	$2, %eax		/* get %ecx */
> > +	call	vc_cpuid32
> > +	movl	%edx, 4(%esp)
> > +
> > +	movl	$3, %eax		/* get %edx */
> > +	call	vc_cpuid32
> > +	movl	%edx, 0(%esp)
> > +
> > +	popl	%edx
> > +	popl	%ecx
> > +	popl	%ebx
> > +	popl	%eax
> > +	addl	$4, %esp
> > +	addl	$2, (%esp)
> > +	iret
> > +
> > +.Lterminate32:
> > +	movl	$MSR_PROTO_TERMINATE, %eax
> > +	movl	$MSR_SEV_GHCB, %ecx
> > +	wrmsr
> > +	rep vmmcall
> > +.Lterm_loop32:
> > +	hlt
> > +	jmp	.Lterm_loop32
> > +
> > +
> >  	.section .codepatch,"a"
> >  	.align	8, 0xcc
> >  	.globl codepatch_begin
> > @@ -757,6 +930,20 @@ codepatch_end:
> >  	.previous
> >
> >  	.data
> > +	.globl	idtlc			/* temporary locore IDT */
> > +idtlc:
> > +	.word	early_idt_end-early_idt-1
> > +	.long	_RELOC(early_idt)
> > +	.align 64, 0xcc
> > +
> > +	.globl	early_idt
> > +early_idt:
> > +	.rept	NIDT
> > +	.quad	0x0000000000000000
> > +	.quad	0x0000000000000000
> > +	.endr
> > +early_idt_end:
> > +
> >  	.globl	gdt64
> >  gdt64:
> >  	.word	gdt64_end-gdt64_start-1
> > Index: arch/amd64/include/ghcb.h
> > ===================================================================
> > RCS file: /data/mirror/openbsd/cvs/src/sys/arch/amd64/include/ghcb.h,v
> > diff -u -p -r1.2 ghcb.h
> > --- arch/amd64/include/ghcb.h	28 May 2025 07:59:05 -0000	1.2
> > +++ arch/amd64/include/ghcb.h	20 Jun 2025 16:31:14 -0000
> > @@ -19,6 +19,8 @@
> >  #ifndef _MACHINE_GHCB_H_
> >  #define _MACHINE_GHCB_H_
> >
> > +#ifndef _LOCORE
> > +
> >  #include <machine/frame.h>
> >
> >  #define GHCB_OFFSET(m)			((m) / 8)
> > @@ -98,12 +100,15 @@ struct ghcb_sync {
> >  	int			sz_c;
> >  	int			sz_d;
> >  };
> > +#endif /* !_LOCORE */
> >
> >  /* Definitions used with the MSR protocol */
> >  #define MSR_PROTO_CPUID_REQ	0x4
> >  #define MSR_PROTO_CPUID_RESP	0x5
> >  #define MSR_PROTO_TERMINATE	0x100
> >
> > +#ifndef _LOCORE
> > +
> >  void	ghcb_clear(struct ghcb_sa *);
> >  int	ghcb_valbm_set(uint8_t *, int);
> >  int	ghcb_valbm_isset(uint8_t *, int);
> > @@ -114,5 +119,7 @@ void	ghcb_sync_val(int, int, struct ghcb
> >  void	ghcb_sync_out(struct trapframe *, uint64_t, uint64_t, uint64_t,
> >  	    struct ghcb_sa *, struct ghcb_sync *);
> >  void	ghcb_sync_in(struct trapframe *, struct ghcb_sa *, struct ghcb_sync *);
> > +
> > +#endif /* !_LOCORE */
> >
> >  #endif /* !_MACHINE_GHCB_H_ */
> > Index: arch/amd64/include/trap.h
> > ===================================================================
> > RCS file: /data/mirror/openbsd/cvs/src/sys/arch/amd64/include/trap.h,v
> > diff -u -p -r1.5 trap.h
> > --- arch/amd64/include/trap.h	15 Apr 2023 01:22:50 -0000	1.5
> > +++ arch/amd64/include/trap.h	20 Jun 2025 16:31:14 -0000
> > @@ -62,3 +62,4 @@
> >  #define	T_XMM		19	/* SSE FP exception */
> >  #define	T_VE		20	/* virtualization exception */
> >  #define	T_CP		21	/* control protection exception */
> > +#define	T_VC		29	/* VMM communication exception */
> > Index: arch/amd64/include/vmmvar.h
> > ===================================================================
> > RCS file: /data/mirror/openbsd/cvs/src/sys/arch/amd64/include/vmmvar.h,v
> > diff -u -p -r1.114 vmmvar.h
> > --- arch/amd64/include/vmmvar.h	24 May 2025 12:47:00 -0000	1.114
> > +++ arch/amd64/include/vmmvar.h	20 Jun 2025 16:31:14 -0000
> > @@ -21,6 +21,8 @@
> >  #ifndef _MACHINE_VMMVAR_H_
> >  #define _MACHINE_VMMVAR_H_
> >
> > +#ifndef _LOCORE
> > +
> >  #define VMM_HV_SIGNATURE 	"OpenBSDVMM58"
> >
> >  /* VMX: Basic Exit Reasons */
> > @@ -94,6 +96,8 @@
> >  #define VMX_MAX_CR3_TARGETS			256
> >  #define VMX_VMCS_PA_CLEAR			0xFFFFFFFFFFFFFFFFUL
> >
> > +#endif	/* ! _LOCORE */
> > +
> >  /*
> >   * SVM: Intercept codes (exit reasons)
> >   */
> > @@ -262,6 +266,8 @@
> >  #define SVM_VMEXIT_VMGEXIT			0x403
> >  #define SVM_VMEXIT_INVALID			-1
> >
> > +#ifndef _LOCORE
> > +
> >  /*
> >   * Exception injection vectors (these correspond to the CPU exception types
> >   * defined in the SDM.)
> > @@ -1056,5 +1062,7 @@ int	vcpu_reset_regs(struct vcpu *, struc
> >  int	svm_get_vmsa_pa(uint32_t, uint32_t, uint64_t *);
> >
> >  #endif /* _KERNEL */
> > +
> > +#endif	/* ! _LOCORE */
> >
> >  #endif /* ! _MACHINE_VMMVAR_H_ */
> >
>