From: Mike Larkin <mlarkin@nested.page>
Subject: Re: [EXT] Re: SEV-ES guest: locore #VC trap handling
To: tech@openbsd.org
Date: Thu, 26 Jun 2025 02:02:51 -0700

On Tue, Jun 24, 2025 at 02:12:22PM +0200, Hans-Jörg Höxer wrote:
> Hi Mike,
>
> thanks for going through the diff!  The diff at hand is the large diff
> I sent some time ago.  It includes more than the actual locore0 #VC diff
> (that is commited now).
>
> You raise some good questions and the diff at hand provides more context.
> So let me try to answer your questions inline, see below.
>
> Take care,
> Hans-Joerg
>

Thanks for clarifying. A couple small points inline...

-ml

> On Fri, Jun 20, 2025 at 12:10:19AM -0700, Mike Larkin wrote:
> > ...
> > > Index: sys/arch/amd64/amd64/locore0.S
> > > ===================================================================
> > > RCS file: /data/mirror/openbsd/cvs/src/sys/arch/amd64/amd64/locore0.S,v
> > > diff -u -p -r1.27 locore0.S
> > > --- sys/arch/amd64/amd64/locore0.S	5 May 2025 23:02:39 -0000	1.27
> > > +++ sys/arch/amd64/amd64/locore0.S	15 Jun 2025 11:03:01 -0000
> > > @@ -111,6 +111,7 @@
> > >  #include <machine/param.h>
> > >  #include <machine/segments.h>
> > >  #include <machine/specialreg.h>
> > > +#include <machine/trap.h>
> > >
> > >  /*
> > >   * override user-land alignment before including asm.h
> > > @@ -193,6 +194,58 @@ bi_size_ok:
> > >  	pushl	$PSL_MBO
> > >  	popfl
> > >
> > > +	/*
> > > +	 * Setup temporary #VC trap handler, in case we are running
> > > +	 * on an AMD CPU in SEV-ES guest mode.  Will be reset by
> > > +	 * init_x86_64().
> > > +	 * We are setting up two handlers:
> > > +	 *
> > > +	 * 1) locore_vc_trap32:  Triggered when we are running in
> > > +	 *    32-bit legacy mode.
> > > +	 *
> > > +	 * 2) locore_vc_trap64:  Triggered when we are running in
> > > +	 *    32-bit compatibility mode.
> > > +	 *
> > > +	 * The latter one is used by vmd(8).
> >
> > Please clarify; *when* is this used by vmd? I believe you mean when
> > we do a direct kernel launch? If not, then why do we need the 32 bit
> > one?
>
> Yes.  To make initial work on SEV-* easier I started with just using
> direct kernel launch.  So no firmware and bootload is involved.  In that
> case, vmd(8) sets up 32-bit compatibility mode.  This means, exceptions
> handlers are required to run in a long mode semgent.  Thus in that
> situation the handler needs to be 64-bit code.
>
> On Linux/KVM with the tianocore EFI firmware we actually run BOOTX64.EFI.
> As far as I can see, this will run locore0 in 32-bit legacy mode.
> So in that situation the trap handler runs in 32-bit legacy mode.
> Thus we need a 32-bit handler, too.
>
> I will update the comment.
>
> Why is vmd(8) configuring compatibility mode for direct kernel launch?
> Legacy mode should also work, no?
>
> In the long term we need to be able to use legacy or EFI boot to run SEV-*
> guests anyway.  So this code might by simplified in the future.
>
> > ...
> > > @@ -384,7 +436,9 @@ cont:
> > >  #define PROC0_DMP2_OFF	(PROC0_DMP3_OFF + NDML3_ENTRIES * NBPG)
> > >  #define TABLESIZE \
> > >      ((NKL4_KIMG_ENTRIES + TABLE_L3_ENTRIES + TABLE_L2_ENTRIES + 1 + UPAGES + \
> > > -	NDML3_ENTRIES + NDML2_ENTRIES + 3) * NBPG)
> > > +	NDML3_ENTRIES + NDML2_ENTRIES + 2 + 3) * NBPG)
> >
> > Can this just be '5' ?
>
> yes.
>
> > ...
> > > @@ -514,6 +579,16 @@ map_tables:
> > >  	shrl	$PGSHIFT,%ecx
> > >  	fillkpt_nx
> > >
> > > +	/* Re-Map GHCB shared (ie. unencrypted) */
> > > +	/* XXX hshoexer:  Only in SEV-ES guestmode. */
> >
> > Can we fix the XXXs in this diff before committing please?
>
> yes.
>
> > ...
> > > @@ -739,12 +819,131 @@ longmode_hi:
> > >  	movw	%ax,%fs
> > >
> > >  	leaq	TABLESIZE(%rsi),%rdi
> > > +	subq	$(NBPG*2), %rdi
> > >  	subq	$(NBPG*3), %rdi
> >
> > Can these be combined to NBPG * 5 ?
>
> yes.
>
> >
> > >  	/* XXX merge these */
> > >  	call	init_x86_64
> > >  	call	main
> > >
> > > +	/* MSR Protocol Request Codes */
> > > +#define MSRPROTO_CPUID_REQ	0x4
> > > +#define MSRPROTO_TERM_REQ	0x100
> > > +
> > > +vc_cpuid64:
> > > +	shll	$30, %eax		/* requested register */
> > > +	orl	$MSRPROTO_CPUID_REQ, %eax
> > > +	movl	%ebx, %edx		/* CPUID function */
> > > +	movl	$MSR_SEV_GHCB, %ecx
> > > +	wrmsr
> > > +	rep vmmcall
> >
> > Out of curiousity, why is the rep prefix needed here?
>
> In the APM vol3 vmmcall is specified as "0x0f 0x01 0xd9" and vmgexit as
> "0xf2/0xf3 0x0f 0x01 0xd9".  As we do not have the vmgext mnemonic in
> llvm yet,  using "rep vmmcall" will produce the right byte sequence.
> Another option would be to use ".byte".  And of course adding the vmgexit
> mnemonic to llvm.
>
> I think I saw the use of "rep vmmcall" in linux kernel code.
>
> The difference between vmmcall and vmgexit seems to be, that if the
> hypervisor does not intercept vmmcall, the CPU will raise #UD and the
> guest will not exit.  However, vmgexit will always exit the guest.
>
> > ...
> > > @@ -1314,6 +1317,38 @@ cpu_init_idt(void)
> > >  	lidt(&region);
> > >  }
> > >
> > > +uint64_t early_gdt[GDT_SIZE / 8];
> > > +
> > > +void
> > > +cpu_init_early_vctrap(paddr_t addr)
> > > +{
> > > +	struct region_descriptor region;
> > > +
> > > +	extern struct region_descriptor gdt64;
> > > +	extern struct gate_descriptor early_idt[NIDT];
> > > +	extern  void Xvctrap_early(void);
> > > +
> > > +	/* Setup temporary "early" longmode GDT, will be reset soon */
> > > +	memset(early_gdt, 0, sizeof(early_gdt));
> > > +	set_mem_segment(GDT_ADDR_MEM(early_gdt, GCODE_SEL), 0, 0xfffff,
> > > +	    SDT_MEMERA, SEL_KPL, 1, 0, 1);
> > > +	set_mem_segment(GDT_ADDR_MEM(early_gdt, GDATA_SEL), 0, 0xfffff,
> > > +	    SDT_MEMRWA, SEL_KPL, 1, 0, 1);
> > > +	setregion(&region, early_gdt, GDT_SIZE - 1);
> > > +	lgdt(&region);
> > > +
> > > +	/* Setup temporary "early" longmode #VC entry, will be reset soon */
> > > +	setgate(&early_idt[T_VC], Xvctrap_early, 0, SDT_SYS386IGT,
> > > +	    SEL_KPL, GSEL(GCODE_SEL, SEL_KPL));
> > > +	setregion(&region, early_idt, NIDT * sizeof(idt[0]) - 1);
> > > +	lidt(&region);
> > > +
> > > +	/* Tell vmm(4) about our GHCB. */
> > > +	ghcb_paddr = addr;
> > > +	memset((void *)ghcb_vaddr, 0, 2 * PAGE_SIZE);
> > > +	wrmsr(MSR_SEV_GHCB, ghcb_paddr);
> > > +}
> > > +
> > >  void
> > >  cpu_init_extents(void)
> > >  {
> > > @@ -1433,6 +1468,13 @@ init_x86_64(paddr_t first_avail)
> > >  	bios_memmap_t *bmp;
> > >  	int x, ist;
> > >  	uint64_t max_dm_size = ((uint64_t)512 * NUM_L4_SLOT_DIRECT) << 30;
> > > +
> > > +	/*
> > > +	 * locore0 mapped 2 pages for use as GHCB before pmap is initialized.
> > > +	 */
> > > +	if (ISSET(cpu_sev_guestmode, SEV_STAT_ES_ENABLED))
> > > +		cpu_init_early_vctrap(first_avail);
> > > +	first_avail += 2 * NBPG;
> >
> > Wasn't 'first_avail' already biased by the subq instructions in locore0 though?
> >
> > Why do we need to adjust again?
>
> Good point.  Maybe I misunderstood the existing code regarding
> early_pte_pages:  For "early pte" locore0 allocates/reserves 3 pages, I
> add another 2 for "early vc trap".  So first_avail is the address of the
> lowest of these 5 pages.  And when actually "consuming" them by providing
> them to cpu_init_early_vctrap() with first_avail I adjust by 2*NBPG.
>
> Similar to:
>
> +       /*
> +        * locore0 mapped 3 pages for use before the pmap is initialized
> +        * starting at first_avail. These pages are currently used by
> +        * efifb to create early-use VAs for the framebuffer before efifb
> +        * is attached.
> +        */
> +       early_pte_pages = first_avail;
> +       first_avail += 3 * NBPG;
>
> So following users of first_avail will not touch these 5 pages.  Does this
> make sense?
>

ok

> > >
> > >  	/*
> > >  	 * locore0 mapped 3 pages for use before the pmap is initialized
> >...
> > > +int
> > > +vctrap(struct trapframe *frame, int user)
> > > +{
> > > +	uint64_t	 sw_exitcode, sw_exitinfo1, sw_exitinfo2;
> > > +	uint8_t		*rip = (uint8_t *)(frame->tf_rip);
> > > +	uint16_t	 port;
> > > +	struct ghcb_sync syncout, syncin;
> > > +	struct ghcb_sa	*ghcb;
> > > +
> > > +	intr_disable();
> > > +
> > > +	memset(&syncout, 0, sizeof(syncout));
> > > +	memset(&syncin, 0, sizeof(syncin));
> > > +
> > > +	sw_exitcode = frame->tf_err;
> > > +	sw_exitinfo1 = 0;
> > > +	sw_exitinfo2 = 0;
> > > +
> > > +	switch (sw_exitcode) {
> > > +	case SVM_VMEXIT_CPUID:
> > > +		ghcb_sync_val(GHCB_RAX, GHCB_SZ32, &syncout);
> > > +		ghcb_sync_val(GHCB_RCX, GHCB_SZ32, &syncout);
> > > +		ghcb_sync_val(GHCB_RAX, GHCB_SZ32, &syncin);
> > > +		ghcb_sync_val(GHCB_RBX, GHCB_SZ32, &syncin);
> > > +		ghcb_sync_val(GHCB_RCX, GHCB_SZ32, &syncin);
> > > +		ghcb_sync_val(GHCB_RDX, GHCB_SZ32, &syncin);
> > > +		frame->tf_rip += 2;
> > > +		break;
> > > +	case SVM_VMEXIT_MSR: {
> > > +		if (user)
> > > +			return 0;	/* not allowed from userspace */
> > > +		if (*rip == 0x0f && *(rip + 1) == 0x30) {
> >
> > Doesn't this break with XO kernels?
>
> Good point! I have not considered XO, yet.  Mh, worst case this would
> mean we can not use SEV-(ES|SNP) together PKU support?  Or would there
> be a way to temporarily remap code readable and then remap XO?
>

It was just a thought. XO kernel will require much more to be done, this would
just add one more thing to the list. No action needed.

> >...
> > > +	case SVM_VMEXIT_IOIO: {
> > > +		if (user)
> > > +			return 0;	/* not allowed from userspace */
> > > +		switch (*rip) {
> > > +		case 0x66: {
> > > +			switch (*(rip + 1)) {
> > > +			case 0xef:	/* out %ax,(%dx) */
> > > +				ghcb_sync_val(GHCB_RAX, GHCB_SZ16, &syncout);
> > > +				port = (uint16_t)frame->tf_rdx;
> > > +				sw_exitinfo1 = (port << 16) |
> > > +				    (1ULL << 5);
> > > +				frame->tf_rip += 2;
> > > +				break;
> > > +			case 0xed:	/* in (%dx),%ax */
> > > +				ghcb_sync_val(GHCB_RAX, GHCB_SZ16, &syncin);
> > > +				port = (uint16_t)frame->tf_rdx;
> > > +				sw_exitinfo1 = (port << 16) |
> > > +				    (1ULL << 5) | (1ULL << 0);
> > > +				frame->tf_rip += 2;
> > > +				break;
> > > +			default:
> > > +				panic("failed to decode prefixed IOIO");
> > > +			}
> > > +			break;
> > > +		    }
> > > +		case 0xe4:	/* in $0x71,%al */
> >
> > I don't understand this assumption. next byte 0xe4 does not imply
> > "in $0x71, %al"; what is meant by this comment? (also the next one).
>
> I am looking at the current byte at RIP, not the next one.  Code like this:
>
> ffffffff81903a76:       e4 71                   in     $0x71,%al
>
> Does this make sense?

My point was "0xe4" just means "in  $xxx, %al", it doesn't mean "$0x71".
Look at the previous examples (0xef, 0xed); I'd recommend making the
comment in this case be more generic.

>
> > > +			ghcb_sync_val(GHCB_RAX, GHCB_SZ8, &syncin);
> > > +			port = *(rip + 1);
> > > +			sw_exitinfo1 = (port << 16) | (1ULL << 4) |
> > > +			    (1ULL << 0);
> > > +			frame->tf_rip += 2;
> > > +			break;
> > ...
> > > Index: sys/arch/amd64/amd64/vmm_machdep.c
> > > ===================================================================
> > > RCS file: /data/mirror/openbsd/cvs/src/sys/arch/amd64/amd64/vmm_machdep.c,v
> > > diff -u -p -r1.57 vmm_machdep.c
> > > --- sys/arch/amd64/amd64/vmm_machdep.c	3 Jun 2025 19:15:29 -0000	1.57
> > > +++ sys/arch/amd64/amd64/vmm_machdep.c	15 Jun 2025 11:03:02 -0000
> > > @@ -1588,15 +1588,15 @@ vcpu_reset_regs_svm(struct vcpu *vcpu, s
> > >  	    SVM_INTERCEPT_MWAIT_UNCOND | SVM_INTERCEPT_MONITOR |
> > >  	    SVM_INTERCEPT_MWAIT_COND | SVM_INTERCEPT_RDTSCP;
> > >
> > > -	/* With SEV-ES we cannot force access XCR0, thus no intercept */
> > > -	if (xsave_mask && !vcpu->vc_seves)
> > > +	if (xsave_mask && !vcpu->vc_seves)	/* XXX hshoexer */
> >
> > What does this XXX mean, and can we fix before commit?
>
> the XXX is already gone.  I think at that time I wanted to remind myself
> to check if my assumption about XCR0 is right.
>
> > ...
> > > @@ -1676,13 +1670,23 @@ vcpu_reset_regs_svm(struct vcpu *vcpu, s
> > >
> > >  		/* Set VMSA. */
> > >  		vmcb->v_vmsa_pa = vcpu->vc_svm_vmsa_pa;
> > > +
> > > +		/* XXX hshoexer: LBR: guest_state_protected flag? */
> >
> > Can we fix this XXX before commit please?
>
> yes, already done.
>
> > > @@ -4608,6 +4623,8 @@ svm_handle_efercr(struct vcpu *vcpu, uin
> > >  	case SVM_VMEXIT_CR4_WRITE_TRAP:
> > >  		vmcb->v_cr4 = vmcb->v_exitinfo1;
> > >  		break;
> > > +		/* XXX hshoexer: no state for CR8? */
> >
> > Can we fix this XXX please before commit?
>
> yes, already done.
>
> > > +		break;
> > >  	default:
> > >  		return (EINVAL);
> > >  	}
> > > @@ -6767,6 +6784,8 @@ vcpu_run_svm(struct vcpu *vcpu, struct v
> > >  		 * On exit, interrupts are disabled, and we are running with
> > >  		 * the guest FPU state still possibly on the CPU. Save the FPU
> > >  		 * state before re-enabling interrupts.
> > > +		 *
> > > +		 * XXX hshoexer:  With SEV-ES we should be able to skip this.
> >
> > Can we fix this XXX please before commit?
>
> yes, already done.
>
> > ...